V2G.TEX[TEX,DEK] - www.SailDart.org

perm filename V2G.TEX[TEX,DEK] blob sn#362432 filedate 1978-06-22 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00021 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00004 00002	%folio 262 galley 1 (C) Addison-Wesley 1978	-
C00015 00003	%folio 265 galley 2 (C) Addison-Wesley 1978	-
C00023 00004	%folio 269 galley 3 (C) Addison-Wesley 1978	-
C00032 00005	%folio 273 galley 4 (C) Addison-Wesley 1978	-
C00047 00006	%folio 276 galley 5 (C) Addison-Wesley 1978	-
C00059 00007	%folio 282 galley 6 (C) Addison-Wesley 1978	-
C00071 00008	%folio 285 galley 7 WARNING: Much tape unreadable! (C) Addison-Wesley 1978	-
C00080 00009	%folio 291 galley 8 (C) Addison-Wesley 1978
C00091 00010	%folio 293 galley 9 (C) Addison-Wesley 1978
C00099 00011	%folio 301 galley 10 (C) Addison-Wesley 1978
C00112 00012	%folio 300 galley 11 (C) Addison-Wesley 1978
C00126 00013	%folio 306 galley 12 (C) Addison-Wesley 1978
C00143 00014	%folio 313 galley 13 (C) Addison-Wesley 1978
C00152 00015	%folio 317 galley 14 (C) Addison-Wesley 1978
C00167 00016	%folio 322 galley 15-16 WARNING: End of 15 and beginning of 16 were unreadable. (C) Addison-Wesley 1978
C00180 00017	%folio 328 galley 17 (C) Addison-Wesley 1978
C00192 00018	%folio 330 galley 18 (C) Addison-Wesley 1978
C00201 00019	%folio 333 galley 19 (C) Addison-Wesley 1978
C00210 00020	%folio 336 galley 20 (C) Addison-Wesley 1978
C00231 00021	
C00232 ENDMK
C⊗;
%folio 262 galley 1 (C) Addison-Wesley 1978	-
|newtype 58320---Computer Programming\quad (Knuth/A.-W.)\quad
f. 262\quad ch. 4\quad g. 1a
Most floating-point routines now in use deal almost
entirely with normalized numbers: inputs to the routines are
assumed to be normalized, and the outputs are always normalized.
Under these conventions we lose the ability to represent a few
numbers of very small magnitude---for example, the value (0,
.00000001) can't be normalized without producing a negative
exponent---but we gain in speed, uniformity, and the ability
to give relatively simple bounds on the relative error in our
computations. (Unnormalized floating-point arithmetic is discussed
in Section 4.2.2.)

Let us now study the normalized floating-point
arithmetic in detail. At the same time we can consider the construction
of subroutines for these operations , assuming that we have
a computer without built-in floating-point hardware.

Machine-language subroutines for floating-point
arithmetic are usually written in a very machine-dependent manner,
using many of the wildest idiosyncrasies of the computer at
hand, and so we find that floating-point addition subroutines
for two different machines usually bear little superficial resemblance
to each other. Yet a careful study of numerous subsroutines
for both binary and decimal computers reveals that these programs
actually have quite a lot in common, and it is possible to discuss
the topics in a machine-independent way.

The first (and by far the most difficult!) algorithm we shall
discuss in this section is a procedure for floating-point addition:
$$$(e↓u, f↓u) \oplus (e↓v, f↓v) = (e↓w, f↓w).\eqno (6)$
|qctr \9skip Note{\sl :} Since floating-point arithmetic is
inherently approximate, not exact, we will use the symbols
$$\oplus ,\quad \ominus ,\quad \otimes ,\quad \odiv
 |qctr \9skip to denote floating-point addition, subtraction,
multiplication, and division, respectively, in order to distinguish
the approximate operations from the true ones.

|qleft \12skip \quad The basic idea involved in floating-point
addition is fairly simple: Assuming that $e↓u ≥ e↓v$, we take
$e↓w = e↓u, f↓w = f↓u + f↓v/b↑{e↓u-e↓v}$ (thereby aligning the
radix points for a meaningful addition), and normalize the result.
Several situations can arise which make this process nontrivial,
and the following algorithm explains the method more precisely.

\algbegin Algorithm A (Floating-point addition).
Given base $b$, excess $q$, $p$-digit, normalized floating-point
numbers $u = (e↓u, f↓u)$ and $v = (e↓v, f↓v)$, this algorithm
forms the sum $w = u \oplus v$. The same procedure may be used
for floating-point subtraction, if $-v$ is substituted for $v$.

|qleft \6skip |newtype {\bf Fig. 2. \xskip} Floating-point addition.
\algstep A1. [Unpack.] Separate the
exponent and fraction parts of the representations
of $u$ and $v$.

\algstep A2. [Assume $e↓u ≥ e↓v.]$ If $e↓u < e↓v$, interchange
$u$ and $v$. (In many cases, it is best to combine step A2 with
step A1 or with some of the later steps.)

\algstep A3. [Set $e↓w.]$ Set $e↓w ← e↓u$.

\algstep A4. [Test $e↓u - e↓v.]$ If $e↓u - e↓v ≥ p + 2$ (large
difference in exponents), set $f↓w ← f↓u$ and go to step A7.
(Actually, since we are assuming that $u$ is normalized, we
could terminate the algorithm; but it is often useful to be
able to normalize a possibly unnormalized number by adding zero
to it.)

\algstep A5. [Scale right.] Shift $f↓v$ to the right $e↓u -
e↓v$ places; i.e., divide it by $b↑{e↓u-e↓v}. Note{\sl : }$This
will be a shift of up to $p + 1$ places, and the next step (which
adds $f↓u$ to $f↓v)$ thereby requires an accumulator capable
of holding $2p + 1$ base $b$ digits to the right of the radix
point. If such a large accumulator is not available, it is possible
to shorten the requirement to $p + 2$ or $p + 3 places if proper
precautions are taken; the details are given in exercise 5$.

\algstep A6. [Add.] Set $f↓w ← f↓u + f↓v$.

\algstep A7. [Normalize.] (At this point $(e↓w, f↓w)$ represents
the sum of $u$ and $v$, but |$f↓w|$ may have more than $p$ digits,
and it may be greater than utinty or less than $1/b.)$ Perform
Algorithm N below, to normalize and round $(e↓w, f↓w)$ into
the final anawer.

\algbegin Algorithm N (Normalization).
A ``raw exponent'' $e$ and a ``raw fraction'' $f$ are converted
to normalized form, rounding if necessary to $p$ digits. This
algorithm assumes that $|f| < b$.

|qleft \6skip |newtype {\bf Fig. 3. \xskip} Normalizati(``fraction
overflow''), go to step N4. If $f = 0$, set $e$ to its lowest
possible value and go to step N7.

\algstep N5. [Is $f$ normalized?] If $|f| ≥ 1/b$, go to step
N5.

\algstep N3. [Scale left.] Shift $f$ to the left by one digit
position (i.e., multiply it by $b)$, and decrease $e$ by 1.
Return to step N2.

\algstep N4. [Scale right.] Shift $f$ to the right by one digit
position (i.e., divide it by $b)$, and increase $e$ by 1.

\algstep N5. [Round.] Round $f$ to $p$ places. (We take this
to mean that $f$ is changed to the nearest multiple of $b↑{-p}$.
It is possible that $(b↑pf)$mod 1 = ${1\over 2}$ so that there
are {\sl two} nearest multiples; if $b$ is even, we choose that
one which makes $b↑pf + {1\over 2}b$ odd. Further discussion
of rounding appears in Section 4.2.2.) It is important to note
that this rounding operation can make $|f| = 1$ (``rounding
overflow''); in such a case, return to step N4.

\algstep N6. [Check {\sl e.]} If $e$ is too large, i.e., larger
than its allowed range, an {\sl exponent overflow} condition
is sensed. If $e$ is too small, an {\sl exponent underflow}
condition is sensed. (See the discussion below; since the result
cannot be expressed as a normalized floating-point number in
the required range, special action is necessary.)

\algstep N7. [Pack.] Put $e$ and $f$ together into the desired
output
%folio 265 galley 2 (C) Addison-Wesley 1978	-
|newtype 91582---Computer Programming\quad (Knuth/A.-W.)\quad
f. 265\quad ch. 4\quad g. 2a
\yyskip The following \MIX\ subroutines, for addition and subtraction
of numbers having the form (4), show how Algorithms A and N
can be expressed as computer programs. The subroutines below
are designed to take one input $u$ from symbolic location ACC,
and the other input $v$ comes from register A upon entrance
to the subroutine. The output $w$ appears both in register A
and location ACC. Thus, a fixed-point coding sequence
$$|newtype LDA A;\quad ADD B;\quad SUB C;\quad STA D;\eqno (7)
|qctr \9skip |newtype would correspond to the floating-point
coding sequence
$$|newtype LDA A,\quad STA ACC;\quad LDA B,\quad JMP FADD;\quad
LDA C,\quad JMP FSUB;\quad STA D.\eqno (8)

\algbegin Program A (Addition, subtraction,
and normalization). The following program is a subroutine
for Algorithm A, and it is also designed so that the normalization
portion can be used by other subroutines which appear later
in this section. In this program and in many other programs
throughout this chapter, OFLO stands for a subroutine which
prints out a message to the effect that \MIX's overflow toggle
was unexpectedly found to be ``on.'' The byle size $b$ is assumed
to be a multiple of 4. The normalization routine NORM assumes
that rI2 = $e$ and rAX = $f$, where rA = 0 implies rX = 0 and
rI2 < $b$.

|qleft \12skip |newtype |tab \qquad |tab \qquad \qquad |tab
\qquad \qquad |tab \qquad \qquad \qquad \qquad \quad |tab \qquad
\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
|tab |zfa ⊗{\sl 01}⊗EXP⊗EQU⊗1:1⊗Definition of exponent field.\cr
${\sl 02}⊗$FSUB⊗STA⊗TEMP⊗Floating-point subtraction subroutine:\cr
${\sl 03}⊗$⊗LDAN⊗TEMP⊗Change sign of operand.⊗${\sl 04}⊗$FADD⊗STJ⊗EXITF⊗Floating-point
addition subroutine:\cr
${\sl 05}⊗$⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
${\sl 06}⊗$⊗STA⊗TEMP⊗TEMP ← $v.\cr
{\sl 07}⊗⊗$LDX⊗ACC⊗rX ← $u.\cr
{\sl 08}⊗⊗$CMPA⊗ACC(EXP)⊗{\sl Steps A{\sl 1}, A{\sl 2}, A{\sl
3} are combined here{\sl :}\cr
{\sl 09}⊗⊗}JGE⊗1F⊗Jump if $e↓v ≥ e↓u.\cr
{\sl 10}⊗⊗$STX⊗FU(0:4)⊗FU ← \pm $f f f f 0.\cr
{\sl 11}⊗⊗$LD2⊗ACC(EXP)⊗rI2 ← $e↓w.\cr
{\sl 12}⊗⊗$STA⊗FV(0:4)⊗\cr
${\sl 13}⊗⊗$LD1N⊗TEMP(EXP)⊗rI1 ← $-e↓v.\cr
{\sl 14}⊗⊗$JMP⊗4F\cr
${\sl 15}⊗$1H⊗STA⊗FU(0:4)⊗FU ← \pm $f f f f 0 (u, v$ interchanged).\cr
${\sl 16}⊗⊗$LD2⊗TEMP(EXP)⊗rI2 ← $e↓w.\cr
{\sl 17}⊗⊗$STX⊗FV(0:4)\cr
${\sl 18}⊗⊗LDIN⊗$ACC(EXP)⊗rI1 ← $-e↓v.\cr
{\sl 19}⊗$4H⊗INC1⊗0, 2⊗rI1 ← $e↓u - e↓v$. (Step A4 unnecessary.)\cr
${\sl 20}⊗$5H⊗LDA⊗FV⊗$A{\sl 5}. Scale right.\cr
{\sl 21⊗}⊗ENTX⊗0⊗$Clear rX.\cr
${\sl 22}⊗⊗$SRAX⊗0,1⊗Shift right $e↓u - e↓v$ places.\cr
${\sl 23}⊗$6H⊗ADD⊗FU⊗$A{\sl 6}. Add.\cr
{\sl 24}⊗⊗$JOV⊗N4⊗$A{\sl 7}. Normalize$. Jump if fraction overflow.\cr
${\sl 25}⊗⊗JXZ⊗NORM⊗$Easy case?\cr
${\sl 26}⊗⊗$CMPA⊗=0=(1:1)⊗Is $f$ normalized?\cr
${\sl 27}⊗⊗JNE⊗N5⊗$if so, round it.\cr
${\sl 28}⊗⊗$SRC⊗5⊗|rX| \quad |rA|.\cr
${\sl 29}⊗⊗$DECX⊗1⊗(rX is positive.)\cr
${\sl 30}⊗⊗$STA⊗TEMP⊗(Operands had opposite signs,\cr
${\sl 31}⊗⊗$STA⊗HALF(0:0)⊗registers must be adjusted\cr
${\sl 32}⊗⊗LDAN⊗TEMP⊗$before rounding and normalization.)\cr
${\sl 33}⊗⊗$ADD⊗HALF\cr
${\sl 34}⊗⊗$ADD⊗HALF⊗complement least significant half.\cr
${\sl 35}⊗⊗$SRC⊗4⊗Jump into normalization routine.\cr
${\sl 36}⊗⊗$JMP⊗N3A\cr
${\sl 37}⊗$HALF⊗CON⊗1//${\sl 37}⊗$HALF⊗CON⊗1//2⊗One half the
word size. (Sign varies)\cr
${\sl 38}⊗$FU⊗CON⊗0⊗Fraction part $f↓u.\cr
{\sl 39}⊗$FV⊗CON⊗0⊗Fraction part $f↓v.\cr
{\sl 40}⊗$NORM⊗JAZ⊗ZRO⊗$N{\sl 1}. Test f.\cr
{\sl 41}⊗$N2⊗CMPA⊗=0=(1:1)⊗$N{\sl 2}. Is f normalized{\sl ?}\cr
{\sl 42}⊗⊗$JNE⊗N{\bf 5}⊗if leading byte nonzero, to N5.\cr
${\sl 43}⊗$N3⊗SLAX⊗1⊗$N{\sl 3}. Scale left.⊗{\sl 44}⊗$N3A⊗DEC2⊗1⊗Decrease
$e$ by 1.\cr
${\sl 45}⊗⊗$JMP⊗N2⊗Return to N2.\cr
${\sl 46}⊗$N4⊗ENTX⊗1⊗$N{\sl 4}. Scale right.\cr
{\sl 47}⊗⊗$SRC⊗1⊗shift right, insert ``1'' with proper sign.\cr
${\sl 48}⊗⊗$INC2⊗1⊗Increase $e$ by 1.\cr
${\sl 49}⊗$N5⊗CMPA⊗=BYTE/2=(5:5)⊗$N{\sl 5}. Round.\cr
{\sl 50}⊗⊗$JL⊗N6⊗is |tail| < ${1\over 2}$$b?\cr
{\sl 51}⊗⊗$JG⊗5F\cr
${\sl 52}⊗⊗$JXNZ⊗5F⊗Is |tail| > ${1\over 2}$$b?\cr
{\sl 53}⊗⊗$STA⊗TEMP⊗|tail| = ${1\over 2}$$b; round to odd.\cr
{\sl 54}⊗⊗$LDX⊗TEMP(4:4)\cr
${\sl 55⊗}⊗$JXO⊗N6⊗To N6 if rX is odd.\cr
${\sl 56}⊗$5H⊗STA⊗??+1(0:0)⊗Store sign of rA.\cr
${\sl 57}⊗⊗$INCA⊗BYTE⊗Add $b↑{-4}$ to $|f|$. (Sign varies)\cr
${\sl 58}⊗⊗$JOV⊗N4⊗Check for rounding overflow.\cr
${\sl 59}⊗N6⊗J2N⊗EXPUN⊗N{\sl 6}. Check e$. Underflow if $e <
0.\cr
{\sl 60}⊗N7⊗ENTX⊗0,2⊗N{\sl 7}. Pack$. rX ← $e.\cr
{\sl 61}⊗⊗$SRC⊗1\cr
${\sl 62}⊗$ZRO⊗DEC2⊗BYTE⊗rI2 ← $e - b.\cr
{\sl 63}⊗$8H⊗STA⊗ACC\cr
${\sl 64}⊗$EXITF⊗J2N⊗??⊗Exit, unless $e ≥ b.\cr
{\sl 65}⊗$EXPOV⊗HLT⊗2⊗Exponent overflow detected\cr
${\sl 66}⊗$EXPUN⊗HLT⊗1⊗Exponent underflow detected\cr
${\sl 67}⊗ACC⊗CON⊗0⊗$Floating-point accumulator\cr
ttt?????????|newtype 58320---Computer Programming\quad (Knuth/A.
%folio 269 galley 3 (C) Addison-Wesley 1978	-
\yyskip The rather long section of code from lines 25 to 37
is needed because \MIX\ has only a 5-byte accumulator for adding
signed numbers while in general $2p + 1 = 9$ places of accuracy
are required by Algorithm A. The program could be shortened
to about half its present length if we were willing to sacrifice
a little bit of its accuracy, but we shall see in the next section
that full accuracy is important. Line 55 uses a nonstandard
\MIX\ instruction defined in Section 4.5.2. The running time for
floating-point addition and subtraction depends on several factors
which are analyzed in Section 4.2.4.

Now let us consider multiplication
and division, which are simpler than addition, and which are
somewhat similar to each other.

\algbegin Algorithm M (Floating-point multiplication
or division). Given base $b$, excess $q$, $p$-digit, normalized
floating-point numbers $u = (e↓u, f↓u)$ and $v = (e↓v, f↓v)$,
this algorithm forms the product $w = u \otimes v$ or the quotient
$w = u \odiv v$.

\algstep M1. [Unpack.] Separate the exponent
and fraction parts of the representations of $u$ and $v$. (Sometimes
it is convenient, but not necessary, to test the operands for
zero during this step.)

\algstep M2. [Operate.] Set
$$\vcenter{\halign{\lft{$#$}\quad⊗\lft{$#$}\quad⊗\lft{#}\cr
e↓w ← e↓u + e↓v - q,⊗f↓w ← f↓uf↓v⊗for multiplication;\cr
\noalign{\vskip 3pt}
e↓w ← e↓u - e↓v + q + 1,⊗f↓w ← (b↑{-1}f↓u)/f↓v⊗for division.\cr}}\eqno(9)$$
(Since the input numbers are assumed to
be normalized, it follows that either $f↓w = 0$, or $1/b↑2 ≤
|f↓w| < 1$, or a division-by-zero error has occurred.) If necessary,
the representation of $f↓w$ may be reduced to $p+2$ or $p+3$
digits at this point, as in exercise 5.

\algstep M3. [Normalize.] Perform Algorithm N on
$(e↓w, f↓w)$ to normalize, round, and pack the result. ({\sl
Note:} Normalization is simpler in this case, since scaling
left occurs at most once, and rounding overflow cannot occur
after division.)\quad\blackslug

\yyskip The following \MIX\ subroutines,
which use the same conventions as Program A above, illustrate
the machine considerations necessary in connection with Algorithm
M.

\algbegin Program M (Floating-point multiplication and division).

\def\\{\noalign{\penalty-200}}
\yyskip
\mixfour{01⊗Q⊗EQU⊗BYTE/2⊗$q$ is half the byte size
02⊗FMUL⊗STJ⊗EXITF⊗Floating-point multiplication subroutine:\cr
03⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
04⊗⊗STA⊗TEMP⊗$\hjust{\tt TEMP} ← v$.\cr
05⊗⊗LDX⊗ACC⊗$\rX ← u$.\cr
06⊗⊗STX⊗FU(0:4)⊗$\hjust{\tt FU} ← \pm\,f\,f\,f\,f\,0$.\cr
07⊗⊗LD1⊗TEMP(EXP)\cr
08⊗⊗LD2⊗ACC(EXP)\cr
09⊗⊗INC2⊗-Q,1⊗$\rI2 ← e↓u + e↓v - q$.\cr
10⊗⊗SLA⊗1\cr
11⊗⊗MUL⊗FU⊗Multiply $f↓u$ times $f↓v$.\cr
12⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\cr
\noalign{\yskip}
14⊗FDIV⊗STJ⊗EXITF⊗Floating-point division subroutine:\cr
15⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
16⊗⊗STA⊗TEMP⊗$\hjust{\tt TEMP} ← v.\cr
17⊗⊗STA⊗FV(0:4)⊗$\hjust{\tt FV} ← \pm\,f\,f\,f\,f\,0$.\cr
18⊗⊗LD1⊗TEMP(EXP)\cr
19⊗⊗LD2⊗ACC(EXP)\cr
20⊗⊗DEC2⊗-Q,1⊗$\rI2 ← e↓u - e↓v + q$.\cr
21⊗⊗ENTX⊗0\cr
22⊗⊗LDA⊗ACC\cr
23⊗⊗SLA⊗1⊗$\rA ← f↓u$.\cr
24⊗⊗CMPA⊗FV(1:5)\cr
25⊗⊗JL⊗*+3⊗Jump if $|f↓u| < |f↓v|$.\cr
26⊗⊗SRA⊗1⊗Otherwise, scale $f↓u$ right\cr
27⊗⊗INC2⊗1⊗\quad and increase rI2 by 1.\cr
28⊗⊗DIV⊗FV⊗Divide.\cr
29⊗⊗JNOV⊗NORM⊗Normalize, round, and exit.\cr
30⊗DVZRO⊗HLT⊗3⊗Unnormalized or zero divisor.\quad\blackslug\cr
\12skip |newtype \quad The most noteworthy feature of this program
is the provision for division in lines 24--27, which is made
in order to ensure enough accuracy to round the answer. If $|f↓u|
< |f↓v|$, straightforward application of Algorithm M would leave
a result of the form ``\pm 0 $f f f f'$' in register A, and
this would not allow a proper rounding without a careful analysis
of the remainder (which appears in register X). So the program
computes $f↓w ← f↓u/f↓v$ in this case, ensuring that $f↓w$ is
either zero or normalized in all cases; rounding can proceed
with five significant bytes, possibly testing whether the remainder
is zero.

We occasionally need to convert between fixed-
and floating-point representations. A ``fix-to-float'' routine
is easily obtained with the help of the normalization algorithm
above; for example, in \MIX\, the following subroutine converts
an integer to floating-point form:
$$|newtype |tab \qquad \quad |tab \qquad \qquad |tab \qquad
\qquad |tab \qquad \qquad \quad |tab \qquad \qquad \qquad \qquad
\qquad \qquad \qquad \quad |tab |zfa ⊗${\sl 01}⊗FLOT⊗STJ⊗EXITF⊗$Assume
that rA = $u$, an integer.\cr
${\sl 02⊗}⊗JOV⊗OFLO⊗$Ensure overflow is off.\cr
${\sl 03}⊗⊗ENT2⊗Q+5⊗$Set raw exponent.\cr
${\sl 04}⊗⊗NTX⊗0\cr
{\sl 05}⊗⊗JMP⊗NORM⊗$N|newtype 58320---Computer
%folio 273 galley 4 (C) Addison-Wesley 1978	-
Programming\quad (Knuth/A.-W.)\quad f. 273\quad ch. 4\quad g.
4a
$$\quad The debugging of floating-point subroutines is usually
a difficult job, since there are so many cases to consider.
Here is a list of common pitfalls which often trap a programmer
or machine designer who is preparing floating-point routines:
$$\quad 1) {\sl Losing the sign.} On many machines (not \MIX\),
shift instructions between registers will affect the sign, and
the shifting operations used in normalizing and scaling numbers
must be carefully analyzed. The sign is also lost frequently
when minus zero is present. (For example, Program A is careful
to retain the sign of register A in lines 30--34. See also exercise
6.)

2) {\sl Failure to treat exponent underflow or
overflow properly.} The size of $e↓w$ should not be checked
until {\sl after} the rounding and normalization, because preliminary
tests may give an erroneous indication. Exponent underflow and
overflow can occur on floating-point addition and subtraction,
not only during multiplication and division; and even though
this is a rather rare occurrence, it must be tested each time.
Enough information should be retained that meaningful corrective
actions are possible after overflow or underflow has occurred.

It has unfortunately become customary in many instances
to ignore exponent underflow and simply to set underflowed results
to zero with no indication of error. This causes a serious loss
of accuracy in most cases (indeed, it is the loss of {\sl all}
the significant digits), and the assumptions underlying floating-point
arithmetic have broken down, so the programmer really must be
told when underflow has occurred. Setting the result to zero
is appropriate only in certain cases when the result is to be
added to a significantly larger quantity. When exponent underflow
is not detected, we find mysterious situations in which $(u
\otimes v) \otimes w$ is zero, but $(u \otimes w) \otimes v$
is not, since $u \otimes v$ results in exponent underflow but
$(u \otimes w) \otimes v$ can be calculated without any exponents
falling out of range. Similarly, we can find positive numbers
$a, b, c, d$, and $y$ such that
$$\8skip (11)|zfa
 |qright \-8skip $(a \otimes y \oplus b) \odiv (c \otimes y
\oplus d) |tab \approx {2\over 3},⊗\4skip \hfill (a \oplus b
\odiv y) \odiv (c \oplus d \odiv y) ⊗= 1\cr
\9skip$ if exponent underflow is not detected. (See exercise
9.) Even though floating-point routines are not precisely accurate,
such a disparity as (11) is quite unexpected when $a$, $b$, $c$,
$d$, and $y$ are all {\sl positive}! Exponent underflow is usually
not anticipated by a programmer, so he needs to be told about
it.

3) {\sl Inserted garbage.} When scaling to the
left it is important to keep from introducing anything but zeros
at the right. For example, note the ``ENTX 0'' instruction in
line 21 of Program A, and the all-too-easily-forgotten ``ENTX
0'' instruction in line 04 of the FLOT subroutine (10). (But
it would be a mistake to clear register X after line 29 in the
division subroutine!)

4) {\sl Unforeseen rounding overflow.} When a number
like .999999997 is rounded to 8 digits, a carry will occur to
the left of the decimal point, and the result must be scaled
to the right. Many people have mistakenly concluded that rounding
overflow is impossible during multiplication, since they look
at the maximum value of $|f↓uf↓v|$ which is $1 - 2b↑{-p} + b↑{-2p}$,
and this cannot round up to 1. The fallacy in this reasoning
is exhibited in exercise 11. Curiously, the phenomenon of rounding
overflow $is$ impossible during floatin-point division (see
exercise 12).
|qleft $\quad (Note{\sl : }$There is one school of though which
says it is harmless to ``round'' .999999997 to .99999999 instead
of to 1.0000000, since this does not increase the worst-case
bounds on relative error. Furthermore the floating-point number
1.0000000 may be said to represent all real values in the interval
$[1.0000000 - 5 \times 10↑{-8}, 1.0000000 + 5 \times 10↑{-8}]$,
while .99999999 represents all values in the much smaller interval
$(.99999999 - 5 \times 10↑{-9}, .99999999 + 5 \times 10↑{-9})$.
Even though the latter interval does not contain the original
number .999999997, each number of the second interval is contained
in the first, so subsequent calculations with the second interval
are no less accurate than with the first. But this argument
is incompatible with the mathematical philosophy of floating-point
arithmetic which is expressed in Section 4.2.2.)

5) {\sl Rounding before normalizing.} Inaccuracies
are caused by premature rounding in the wrong digit position.
This error is obvious when rounding is being done to the left
of the appropriate position; and it is also dangerous in the
less obvious cases where rounding first is done too far to the
right, then is followed by rounding in the true position. For
this reason it is a mistake to round during the ``scaling-right''
operation in step A5, except as prescribed in exercise 5. (Yet
the special case of roundkcng???prescribed in exercise 5. (Yet
the special case of rounding in step N5, then rounding again
after rounding overflow has occurred, is harmless, because rounding
overflow has occurred, is harmless, because rounding overflow
always yields \pm 1.0000000 and this is unaffected by the subsequent
rounding process.)

6) {\sl Failure to retain enough precision in intermediate
calculations.} Detailed analyses of the accuracy of floating-point
arithmetic, made in the next section, suggest strongly that
normalizing floating-point routines should always deliver a
properly rounded result to the maximum possible accuracy. There
should be no exceptions to this dictum, even in cases which
occur with extremely low probability; the appropriate number
of significant digits should be retained throughout the computations,
as stated in Algorithms A and M.

\subsectionbegin{C. Floating-point hardware} Nearly
every large computer intended for scientific calculations includes
floating-point arithmetic as part of its repertoire of built-in
operations. Unfortunately, the design of such hardware usually
includes some anomalies which result in dismally poor behavior
in certain circumstances, and we hope that future computer designers
will pay more attention to providing the proper behavior than
they have in the past. It cost only a little more to build the
machine right, and considerations in the following section show
that substantial benefits will be gained.

The \MIX\ computer, which is being used as an example
of a ``typical'' machine in this series of books, has an optional
``floating-point attachment'' (available at extra cost) which
includes the following operations:
|qleft ??2? FADD, FSUB, FMUL, FDIV, FLOT, FCMP (C = 1, 2, 3,
4, 5, 56, respectively; F = 6). The contents of rA after the
operation ``FADD V'' are precisely the same as the contents
of rA after the operations
$$|newtype
 |qctr ⊗STA\quad ACC\cr
⊗LDA\quad V\cr
⊗JMP\quad FADD\cr
\9skip |newtype where FADD is the subroutine which appears earlier
in this section, except that both operands are automatically
normalized before entry to the subroutine, if they are not already
in normalized form. (If exponent underflow occurs during this
pre-normalization, but not during the normalization of the answer,
no underflow is signalled.) Similar remarks apply to FSUB, FMUL,
and FDIV. The contents of rA after the operation ``FLOT'' are
the contents after ``JMP FLOT'' in the subroutine (10) above.

The contents of rA are unchanged by the operation
``FCMP V''; this instruction sets the comparison indicator to
less, equal, or greater, depending on whether the contents of
rA are ``definitely less than,'' ``approximately equal to,''
or ``definitely greater than'' V; this subject is discussed
in the next section, and the precise action is defined by the
subroutine FCMP of exercise 4.2.2--17 with EPSILON in location
0.

No register other than rA is affected by any of
the floating-point operations. If exponent overflow or underflow
occurs, the overflow toggle is turned on and the exponent of
the answer is given modulo the byte size. Division by zero leaves
undefined garbage in rA.

Sometimes it is helpful to use floating-point operators
in a nonstandard manner. For example, if the operation FLOT
had not been included as part of \MIX's floating-point attachment,
we could easily achieve its effect on 4-byte numbers by writing
$$|newtype \28skip (12)|zfa
 |qright \-28skip |tab \quad |tab \quad |tab |tab |zfa ⊗FLOT⊗STJ⊗9F\cr
⊗SLA⊗1\cr
⊗ENTX⊗Q+4\cr
⊗SRC⊗1\cr
⊗FADD⊗=0=\cr
9H⊗JMP⊗??\cr
\12skip |newtype This routine is not strictly equivalent to
the FLOT operator, since it assumes that the 1:1 byte of rA
is zero, and it destroys rX. The handling of more general situations
is a little trickly, because rounding overflow can occur even
during a FLOT operation.
|qleft ???|newtype 58320---Computer
%folio 276 galley 5 (C) Addison-Wesley 1978	-
Programming\quad (Knuth/A.-W.)\quad f. 276\quad ch. 4\quad g.
5a
$$\quad Similarly, if we want to round a number $u$ from floating-point
form to the nearest fixed-point integer, and if we know that
number is nonnegative and will fit in at most three bytes, we
may write
$$|newtype FADD\quad FUDGE
|qctr \9skip |newtype where FUDGE contains the constant
$$|newtype +\qquad Q+4\qquad 1\qquad 0\qquad 0\qquad 0\qquad
;
|qctr \9skip |newtype the result in rA will be
$$|newtype +\qquad Q+4\qquad 1\qquad round $(u)\qquad .\eqno
(13)$
|qctr \9skip |newtype \quad Some machines have floating-point
hardware which suppresses automatic rounding and gives additional
information about the less-significant digits that would ordinarily
be rounded off. Such facilities are of use in extending the
precision of floating-point calculations, but their implementation
is usually subject to unfortunate anomalies, and unrounded results
are generally undesirable in single-precision calculations.

\subsectionbegin{D. History and bibliography} The origins
of floating-point notation can be traced back to the Babylonians
(c. 1800 |newtype B|newtype .|newtype C|newtype .), who made
use of radix 60 floating-point arithmetic but did not have a
notation for the exponents. The appropriate exponent was always
somehow ``understood'' by the man doing the calculations. At
least one case has been found in which the wrong answer was
given, because addition was performed with improper alignment
of the operands, but such examples are very rare; see O. Neugebauer,
{\sl The Exact Sciences in Antiguity (}Princeton University
Press, 1952), 26--27. Another early contribution to floating-point
notation is due to the Greek mathematician Apollonius (3rd century
|newtype B|newtype .|newtype C|newtype .), who apparently was
the first to explain how to simplify multiplication by collecting
powers of 10 separately from their coefficients, at least in
simple cases. [For a discussion of Apollonius's method, see
Pappus, {\sl Mathematical Collections} (4th century |newtype
A|newtype .|newtype D|newtype .).] After the Babylonian civilization
died out, the first significant uses of floating-point notation
for products and quotients did not emerge until much later,
about the time logarithms were invented (1600) and shortly afterwards
when Oughtred invented the slide rule (1630). The modern notation
``$x↑n$'' for exponents was being introduced at about the same
time; separate symbols for $x$ squared, $x$ cubed, etc.\ had
been in use before this.

Floating-point arithmetic was incorporated into
the design of some of the earliest computers. It was independently
proposed by Leonardo Torres y Quevedo in Madrid, 1914; by Konrad
Zuse in Berlin, 1936; and by George Stibitz in New Jersey, 1939.
Zuse's machines used a floating-binary representation which
he called ``semi-legarithmic notation''; he also incorporated
conventions for dealing with the special quantities ``∞'' and
``undefined.'' The first American computers to operate with
floating-point arithmetic hardware were the Bell Laboratories'
Model V and the Harvard Mark II, both of which were relay calculators
designed in 1944. [See B. Randell, {\sl The Origins of Digital
Computers} (Berlin: Springer, 1973), pp.\ 100, 155, 163--164,
259--260; {\sl Proc.\ Symp.\ on Large-Scale Digital Calculating
Machinery} (Harvard, 1947), 41--68, 69--79; {\sl Datamation
\bf 13} (April 1967), 35--44, (May 1967), 45--49; {\sl Zeitschrift
f\"ur angewandte Math.\ und Physik \bf 1} (1950),
345--346.]

The use of floating-binary arithmetic
was seriously considered in 1944--1946 by researchers at the
Moore School in their plans for the first {\sl electronic} digital
computers, but it turned out to be much harder to implement
floating-point circuitry with tubes than with relays. The group
realized that scaling is a problem in programming, but felt
that it is only a very small part of a total programming job
which is usually worth the time and trouble it takes, since
it tends to keep a programmer aware of the numerical accuracy
he is getting. Furthermore, they argued that floating-point
representation takes up valuable memory space, since the exponents
must be stored, and that it becomes difficult to adapt floating-point
arithmetic to multiple-precision calculations. [See von Neumann's
{\sl Collected Works} {\bf 5} (New York: Macmillan, 1963), 43,
73--74.] At this time, of course, they were designing the first
stored-program computer and the second electronic computer,
and their choice had to be either fixed-point {\sl or} floating-point
arithmetic, not both. They anticipated the coding of floating-binary
routines, and in fact ``shift-left'' and ``shift-right'' instructions
were put into their machine primarily to make such routines
more efficient. The first machine to have both kinds of arithmetic
in its hardware was apparently a computer developed at General
Electric Company [see
{\sl Proc.\ 2nd Symp.\ on Large-Scale
Digital Calculating Machinery} (Cambridge: Harvard University
Press, 1951), 65--69].

Floating-point subroutines and interpretive systems
for early machines were coded by D. J. Wheeler and others, and
the first publication of such routines was in {\sl The Preparation
of Programs for an Electronic Digital Computer} by Wilkes, Wheeler,
and Gill (Reading, Mass.: Addison-Wesley, 1951), subroutines
A1--A11, pp.\ 35--37, 105--117. It is interesting to note that
floating-{\sl decimal} subroutines are described here, although
a binary computer was being used; in other words, the numbers
were represented as 10$↑ef$, not $2↑ef$, and therefore the scaling
operations required multiplication or division by 10. On this
particular machine such decimal scaling was about as easy as
shifting, and the decimal approach greatly simplified input-output
conversions.

Most published references to the details of floating-point
arithmetic routines are scattered in ``technical memorandums''
distributed by various computer manufacturers, but there have
been occasional appearances of these routines in the open literature.
Besides the reference above, see R. H. Stark and D. B. MacMillan,
{\sl Math.\ Comp.\ \bf 5} (1951), 86--92, where a plugboard-wired
program is described; D. McCracken, {\sl Digital computer Programming}
(New York: Wiley, 1957), 121--131; J. W. Carr III, {\sl CACM
\bf 2} (May 1959), 10--15; W. G. Wadey, {\sl JACM \bf 7} (1960),
129--139; D. E. Knuth, {\sl JACM \bf 8} (1961), 119--128; O.
Kesner, {\sl CACM \bf 5} (1962), 269--271; F. P. Brooks and
K. E. Iverson, {\sl Automatic Data Processing} (New York: Wiley,
1963), 184--199. For a discussion of floating-point arithmetic
from a computer designer's standpoint, see ``Floating-point
operation'' by S. G. Campbell, in {\sl Planning a computer System,}
ed.\ by W. Buchholz (New York: McGraw-Hill, 1962), 92--121. Additional
references are given in Section 4.2.2.

|qleft \24skip {\:r EXERCISES

\exno 1. [10] How would
Avogadro's number and Planck's constant be represented in base
100, excess 50, four-digit floating-point notation? (This would
be the representation used by \MIX\, as in (4),|newtype 58320---Computer
%folio 282 galley 6 (C) Addison-Wesley 1978	-
Programming\quad (Knuth/A.-W.)\quad f. 282\quad ch. 4\quad g.
6a
\exno 2. [12] Assume that the exponent
$e$ is constrained to lie in the range $0 ≤ e ≤ E$; what are
the largest and smallest positive values which can be written
as base $b$, excess $q$, $p$-digit floating-point numbers? What
are the largest and smallest positive values which can be written
as {\sl normalized} floating-point numbers with these specifications?

\exno 3. [11] Show that if we are using normalized floating-binary
arithmetic, there is a way to increase the precision slightly
without loss of memory space: A $p$-bit fraction part can be
represented using only $p - 1$ bit positions of a computer word,
if the range of exponent values is decreased very slightly.

\exno 4. [10] Assume that $b = 10, p = 8$. What result does
Algorithm A give for (50, +.98765432) \oplus (49, +.33333333)?
For (53, -.99987654) \oplus (54, +.10000000)? For (45, -.50000001)
\oplus (54, +.10000000)?

\exno 5. [24] Let us say that $x ~ y$ (with respect to a given
radix $b)$ if $x$ and $y$ are real numbers satisfying the following
conditions:
$$\lfloor $x/b\rfloor = \lfloor y/b\rfloor $;
|qctr \4skip x mod $b = 0\qquad$ iff\qquad $y$ mod $b = 0$;
|qctr \4skip 0 < x mod $b < {1\over 2}b\qquad$ iff\qquad $0
< y$ mod $b < {1\over 2}b$;
|qctr \4skip x mod $b = {1\over 2}b\qquad$ iff\qquad $y$ mod
$b = {1\over 2}b$;
|qctr \4skip ${1\over 2}$b < x mod $b < b\qquad$ iff\qquad ${1\over
2}b < y$ mod $b < b$.
|qctr \9skip Prove that if $f↓v$ is replaced by $b↑{-p-2}F↓v$
between steps A5 and A6 of Algorithm A, where $F↓v ~ b↑{p+2}f↓v$,
the result of that algorithm will be unchanged. (If $F↓v$ is
an integer and $b$ is even, this operation essentially truncates
$f↓v$ to $p + 2$ places while remembering whether any nonzero
digits have been dropped, thereby minimizing the length of register
which is needed for the addition in step A6.)

\exno 6. [20] If the result of a FADD instruction is zero, what
will be the sign of rA, according to the definitions in this
section?

\exno 7. [27] Discuss floating-point arithmetic
using balanced ternary notation.

\exno 8. [20] Give examples of normalized eight-digit floating-decimal
numbers $u$ and $v$ for which addition yields\xskip (a) exponent underflow,\xskip
(b) exponent overflow, assuming that exponents must satisfy
$0 ≤ e < 100$.

\exno 9. [M24] (W. M. Kahan.)\xskip Assume
that the occurrence of exponent underflow causes the result
to be replaced by zero, with no error indication given. Using
excess zero, eight-digit floating-decimal numbers with $e$ in
the range $-50 ??¬E $e < 50$, find positive values of $a$, $b$,
$c$, $d$, and $y$ such that (11) holds.

\exno 10. [12] Give an example of normalized eight-digit floating-decimal
numbers $u$ and $v$ for which rounding overflow occurs in addition.

\exno 11. [M20] Give an example of normalized excess 50, eight-digit
floating-decimal numbers $u$ and $v$ for which rounding overflow
occurs in multiplication.

\exno 12. [M25] Prove that rounding overflow cannot occur during
the normalization phase of floating-point division.

\exno 13. [30] When doing ``interval arithmetic'' we don't want
to round the results of a floating-point computation; we want
rather to implement operations such as \quad and \quad which
give the tightest possible representable bounds on the true
sum: $u \quad v ≤ u + v ≤ u \quad v$. How should the algorithms
of this section be modified for such a purpose?

\exno 18. [25] Consider a binary computer with 36-bit words,
on which positive floating-binary numbers are represented as
(0$e↓1e↓2 \ldotsm e↓8f↓1f↓2 \ldotsm f↓{27})↓2$; here $(e↓1e↓2
\ldotsm e↓8)↓2$ is an excess (10000000)$↓2 exponent and (f↓1f↓2
\ldotsm f↓{27})↓2$ is a 27-bit fraction. Negative floating-point
numbers are represented by the $two's complement$ of the corresponding
positive representation (see Section 4.1). This, 1.5 is $({\sl
201600000000})↓8$ while $-1.5$ is $({\sl 576200000000})↓8$; the
octal representations of 1.0 and $-1.0$ ${\sl 201400000000 }$and
${\sl 5764000000??\$, respectively. Note that bit $f↓1$ of a
normalized positive number is always 1, while it is almost always
zero for negative numbers; the exceptional cases are representations
of $-2↑k$.
|qleft \qquad Suppose that the exact result of a floating-point
operation has the octal code ${\sl 572|740000000}|{\sl 01}$;
this (negative) 33-bit fraction must be normalized and rounded
to 27 bits. If we shift left until the leading fraction bit
is zero, we get ${\sl 576}|{\sl 000000000}|{\sl 20}$, but this
rounds to the illegal value ${\sl 576}|{\sl 000000000}$; we
have over-normalized, since the correct answer is ${\sl 575}|{\sl
400000000}$. On the other hand if we star with the value ${\sl
572}|{\sl 740000000}|{\sl 05}$ and stop before over-normalizing
it, we obtain ${\sl 575}|{\sl 400000000}|{\sl 50}$ which rounds
to the unnormalized number ${\sl 575}|{\sl 400000001}$; subsequent
normalization yields ${\sl 576}|{\sl 000000002}$ while the correct
answer is ${\sl 576}|{\sl 000000001$.

Give a simple, correct rounding rule which resolves
this dilemma on such a machine (without abandoning two's complement
notation).

\exno 19. [24] What is the running time for the FADD subroutine
in Program A, in terms of relevant characteristics of the data?
What is the maximum running time, over all inputs that do not
cause overflow or underflow?

\exno 14. [25] Write a \MIX\ subroutine which begins with an arbitrary
floating-point number in register A, not necessarily normalized,
and which converts it to the nearest fixed-point integer (or
determines that it is too large in absolute value to make such
a conversion possible).

\exno 15. [28] Write a \MIX\ subroutine, to be used in connection
with the other subroutines of this section, that calculates
$u$ mod 1, that is, $u - \lfloor u\rfloor rounded to nearest
floating-point number, given a floating-point number u$. Note
that when $u$ is a very smalll negative number, $u$ mod 1 will
be rounded so that the result is unity (even though $u$ mod
1 has been definedf tto that the result is unity (even though
$u$ mod 1 has been defined to be always {\sl less} than unity,
as a real number).

\exno 16. [HM21] (Robert L. Smith.)\xskip Design an algorithm to compute
the real and imaginary parts of the complex number $(a + bi)/(c
+ di)$, given real floating-point values $a, b, c$, and $d$.
Avoid the computation of $c↑2 + d↑2$, since it would cause floating-point
overflow even when $|c\dprime$ or $|d|$ is approximately the
square root of the maximum allowable floating-point value.

\exno 17. [40] (John Cocke.)\xskip Explore the idea of extending the
range of floating-point numbers by defining a single-word representation
in which the precision of the fraction decreases as the magnitude
of the exponent increases.
|qleft
%folio 285 galley 7 WARNING: Much tape unreadable! (C) Addison-Wesley 1978	-
??U0}|newtype 58320---Computer Programming\quad (Knuth/A.-W.)\quad
f. 285\quad ch. 4\quad g. 7a
$${\:r 4.2.2. Accuracy of Floating-Point Arithmetic

|qleft }\6skip |newtype Floating-point computation is by nature
inexact, and it is not difficult to misuse it so that the computed
answers consist almost entirely of ``noise.'' One of the principal
problems of numerical analysis is to determine how accurate
the results of certain numerical methods will be; a ``credibility-gap''
problem is involved here: we don't know how much of the computer's
answers to believe. Novice computer users solve this problem
by implicitly trusting in the computer as an infallible authority;
they tend to believe that all digits of a printed answer are
significant. Disillusioned computer users have just the opposite
approach, they are constantly afraid their answers are almost
meaningless. Many a serious mathematician has attempted to give
rigorous analyses of a sequence of floating-point operations,
but has fouhd the task to be so formidable he has tried to content
himself with plausibility arguments instead.

A thorough examination of error analysis techniques
is, of course, beyond the scope of this book, but we will in
this section study some of the characteristics of floating-point
arithmetic errors. Our goal is to discover how to perform floating-point
arithmetic in such a way that reasonable analyses of error propagation
are facilitated as much as possible.

A rough (but reasonably useful) way to express
the behavior of floating-point arithmetic can be based on the
concept of ``significant figures'' or {\sl relative error.}
If we are representing an exact real number $x$ inside a computer
by using the approximation $|spose 7x = x(1 + ε)$, the quantity
$ε = (|spose 7x - x)/x$ is called the relative error of approximation.
Roughly speaking, the operations of floating-point multiplication
and division do not magnify the relative error by very much;
but floating-point subtraction of nearly equal quantities (and
floating-point addition, $u \oplus v$, where $u$ is nearly equal
to $-v$) can very greatly increase the relative error. So we
have a general rule of thumb, that a substantial loss of accuracy
is expected from such additions and subtractions, but not from
multiplications and divisions. On the other hand, the situation
is somewhat paradoxical and needs to be understood properly,
since ``bad'' additions and subtractions are performed with
perfect accuracy! (See exercise 25.)

One of the consequences of the possible unreliability
of floating-point addition is that the associative law breaks
down:
$$$(u \oplus v) \oplus w ≠ u \oplus (v \oplus w),\qquad$ for
certain $u, v, w.\eqno (1)$
|qctr \9skip For example,
$$(11111113. \oplus -11111111.) \oplus 7.5111111 = 2.0000000
\oplus 7.5111111\cr
\3skip = 9.5111111;\cr
\4skip 11111113. \oplus (-11111111. \oplus 7.5111111) = 11111113.
\oplus -11111103.\cr
\3skip = 10.000000.\cr
\9skip |newtype ?????????ct operations +, -, \times, /.)

In view of the failure of the associative law,
the comment of Mrs La Touchehwhich appears at the beginning
of this chapter [taken from {\sl Math.\ Gazette \bf 12} (1924),
95] makes a good deal of sense with respect to floating-point
arithmetic. Mathematical notations like ``$a↓1 + a↓2 + a↓3$''
or ``\sum ↓{1≤k≤$n} a↓k$'' are inherently based upon the assumption
of associativity, so a programmer must be especially careful
that he does not implicitly assume the validity of the associative
law.

\subsectionbegin{A. An axiomatic approach} Although
the associative law is not valid, the commutative law
$$$u \oplus v = v \oplus u\eqno (2)$
|qctr \9skip does hold, and this law can be a valuable conceptual
asset in programming and in the analysis of programs. This example
suggests that we would look for important laws which {\sl are}
satified by \oplus , \ominus , \otimes , and \odiv ; it is
not unreasonable to say that {\sl floating-point routines should
be designed to preserve as many of the ordinary mathematical
laws as possible.}

Let us now consider some of the other basic laws
which are valid for normalized floating-point operations as
described in the previous section. First we have
$$
|qctr \hfill u \ominus v ⊗= u \oplus -v;\eqno (3)\cr
\4skip -(u \oplus v) = -u \oplus -v;\eqno (4)
|qctr \4skip u \oplus u \otimes v = round($u \times v),\qquad
u \odiv v =$ round($u/v)$,
|qctr \9skip where round($x)$ denotes the best floating-point
approximation to $x$. We have
$$round(-x) = $-round$(x),\eqno (10)$
|qctr \4skip x ≤ y\qquad implies\qquad round($x) ≤$ round($y),\eqno
(11)$
|qctr \9skip and these fundamental relations lead to imediate
proofs of properties (2) through (8). We can also write down
several more identities:
$$$u \otimes v = v \otimes u,\qquad (-u) \otimes v = -(u \otimes
v),\qquad 1 \otimes v = v$;
|qctr \5skip u \otimes v = 0\qquad if and only if\qquad $u =
0\qquad$ or\qquad $v = 0$;
|qctr \4skip (-u) \odiv v = u \odiv (-v) = -(u \odiv v),\qquad
0 \odiv v = 0,
|qctr \4skip u \odiv 1 = u,\qquad u \odiv u = 1;
|qctr \9skip if $u ≤ v$ and $w > 0$ then $u \otimes w ≤ v \otimes
w, u \odiv w ≤ v \odiv w$, and $w \odiv u ≥ w \odiv
v$. If $u \oplus v = u + v$ then $(u \oplus v) \ominus v = u$,
and if $u \otimes v = u \times ≠ 0$ then $(u \otimes v) \odiv
v = u$, etc. We see that a good deal of regularity is present
in spite of the inexactness of the floating-point
%folio 291 galley 8 (C) Addison-Wesley 1978
o ????????????|newtype 58320---Computer Programming\quad (Knuth/A.-W.)\quad
f. 291\quad ch. 4\quad g. 8a
$$\quad Several familiar rules of algebra are still, of course,
conspicuously absent from the collection of identities above;
the associative law for floating-point multiplication is not
strictly true, as shown in exercise 3, and the distributive
law between \otimes and \oplus can fail rather badly: Let $u
= 20000.000$, $v = -6.0000000$, and $w = 6.0000003$; then
|qleft
 |qctr \9skip \hfill (u \otimes v) \oplus (u \otimes w) ⊗= -120000.00
\oplus 120000.01 = .010000000\cr
\4skip \hfill u \otimes (v \oplus w) ⊗= 20000.000 \otimes .00000030000000
= .0060000000\cr
\9skip so
$$u \otimes (v \oplus w) ≠ (u \otimes v) \oplus (u \otimes w).\eqno
(12)
|qctr \9skip On the other hand we do have $b \otimes (v \oplus
w) = (b \otimes v) \oplus (b \otimes w)$, since
$$round($bx) = b$ round($x).\eqno (13)$
|qctr \9skip (Strictly speaking, the identities and inequalities
we are considering in this section implicitly assume that no
exponent underflow or overflow occurs. The function round($x)$
is undefined when $|x|$ is too small or too large, and equations
such as (13) hold only when both sides are defined.)

The failure of Cauchy's fundamental inequality
$(x|spose ↓1↑2 + \cdots + x|spose ↓n↑2)(y|spose ↓1↑2 + \cdots
+ y|spose ↓n↑2) ≥ (x,y↓1 + \cdots + x↓ny↓n)↑2$ is another important
example where traditional algebra breaks down in the presence
of floating-point arithmetic; exercise 7 exhibits floating-binary
numbers $u$ and $v$ such that $2(u|spose ↑|??2`↑2 \oplus v|spose
↑|??2`↑2) < (u \oplus v)|spose ↑|??2`↑2$. Novice programmers
who calculate the standard deviation of some observations by
using the textbook formula
$$$\sigma = |newtype \left(|newtype n \sum ↓{1≤k≤n}x|spose ↓k↑2-\left(\sum
↓{1≤k≤n}x↓k}↑2|newtype }|newtype \left/n(n - 1)\eqno (14)$
|qctr \9skip ?????????often find them selves taking the square
root of a negative number. A much better way to calculate means
and standard deviations with floating-point arithmetic is to
use the recurrence formulas
$$
|qctr \hfill M$↓1 ⊗= x↓1,\hfill M↓k ⊗= M↓{k-1} \oplus (x↓k \ominus
M↓{k-1}) \odiv k,\eqno (15)\cr
???\4skip \hfill S↓1 ⊗= 0,\hfill S↓k ⊗= S↓{k-1} \oplus (x↓k
\ominus M↓{k-1}) \otimes (x↓k \ominus M↓k),\eqno (16)\cr
\9skip$ for $2 ≤ k ≤ n$, where $\sigma = \sqrt{S↓n/(n - 1)}$.
[Cf.\ B. P. Welford, {\sl Technometrics \bf 4} (1962), 419--420.]
With this method $S↓n$ can never be negative, and we avoid other
serious problems encountered by the naive method of accumulating
sums, as shown in exercise 16. (See exercise 19 for a summation
technique that provides an even\quad Although algebraic laws
do not always hold exactly, we can often show that they aren't
too far off base. When $b↑{e-1} ≤ x < b↑e$ we have round$(x)
= x + \rho (x)$ where |$\rho (x)| ≤ {1\over 2}b↑{e-p}$; hence
$$round($x) = x\biglp 1 + \delta (x)\bigrp ,\eqno (17)$
|qctr \9skip where the relative error is bounded independently
of $x$:
$$|\delta (x)| ≤ ${1\over 2}$b$↑{1-p}.\eqno (18)$
|qctr \9skip We can use this inequality to estimate the relative
error of normalized floating-point calculations in a simple
way, since $u \oplus v = (u + v)\biglp 1 + \delta (u + v)\bigrp
, etc$.

As an example of typical error-estimation procedures,
let us consider the associative law for multiplication. Exercise
3 shows that $(u \otimes v) \otimes w$ is not in general equal
to $u \otimes (v \otimes w)$; but the situation in this case
is much better than it was with respect to the associative law
of addition (1) and the distributive law (12). Infact, we have
$$$(u \otimes v) \otimes w |tab = \biglp (uv)(1 + \delta ↓1)\bigrp
\otimes w = uvw(1 + \delta ↓1)(1 + \delta ↓2),⊗\4skip \hfill
u \otimes (v \otimes w) ⊗= u \otimes \biglp (vw)(1 + \delta
↓3)\bigrp = uvw(1 + \delta ↓3)(1 + \delta ↓4)\cr
\9skip$ for some $\delta ↓1, \delta ↓2, \delta ↓3, \delta ↓4$,
provided that no exponent underflow or overflow occurs, where
$|\delta ↓j| ≤ {1\over 2}b↑{1-p}$ for each $j$. Hence
$$${(u \otimes v) \otimes w\over u \otimes (v \otimes w)}$ =
${(1 + \delta ↓1)(1 + \delta ↓2)\over (1 + \delta ↓3)(1 + \delta
↓4)}$ = 1 + \delta ,
|qctr \9skip where
$$$|\delta | ≤ 2b↑{1-p}/(1 - {1\over 2}b↑{1-p})↑2.\eqno (19)$
|qctr \9skip \quad The number $b↑{1-p}$ occurs so often in such
analyses, it has been given a special name, one {\sl ulp (}meaning
one ``unit in the last place'' of the fraction part). Floating-point
operations are correct to within half an $ulp$, and the calculation
of {\sl uvw} by two floating-point multiplications will be correct
within about one {\sl ulp} (ignoring second-order terms); hence
the associative law for multiplication holds to within about
two {\sl ulps} of relative error.

We have shown that $(u \otimes v) \otimes w$ is
{\sl approximately equal to u \otimes (v \otimes w),} except
when exponent overflow or underflow is a problem. It is worth
while studying this intuitive idea of being ``approximately
equal'' in more detail; can we make such a statement more precise
in a reasonable way?

A programmer using floating-point arithmetic almost
never wants to test if $u = v$ (or at least he hardly ever should
try to do so), because this is an extremely improbable occurrence.
For example, if a recurrence relation
$$$x↓{n+1} = f(x↓n)$
|qctr \9skip is being used, where the theory in some textbook
says $x↓n$ approaches a limit as $n → ∞$, it is usually a mistake
to wait until $x↓{n+1} = x↓n$ for some $n$, since the sequence
$x↓n$ might be periodic with a longer period due to the rounding
of intermediate results. The proper procedure is to wait until
|$x↓{n+1} - x↓n| < \delta $, for some suitably chosen number
$\delta $; but since we don't necessarily know the order of
magnitude of $x↓n$ in advance, it is even better to wait until
$$$|x↓{n+1} - x↓n| ≤ ε|x↓n|;\eqno (20)$
|qctr \8skip now $ε$ is a number that is much easier to select.
This relation (20) is another way of saying that $x↓{n+1}$ and
$x↓n$ are approximately equal; and our discussion indicates
that a relation of ``approximately equal'' would be more useful
than the traditional rel|newtype 58320---Computer Programming\quad
(Knuth/A.-W.)\quad f
%folio 293 galley 9 (C) Addison-Wesley 1978
. 293\quad ch. 4\quad g. 8a
$$\quad In other words, the fact that strict equality of floating-point
values is of little importance implies that we ought to have
a new operation, {\sl floating-point comparison,} which is intended
to help assess the relative values of two floating-point quantities.
The following definitions seem to be appropriate, for base $b$,
excess $q$, floating-point numbers $u = (e↓u, f↓u)$ and $v =
(e↓v, f↓v)$:
|qleft
 |qctr \9skip u ?? v\quad (ε)\quad if and only if\hfill $v -
u ⊗> ε$ max($b↑e|infsup u↑{-q}, b↑e|infsup v↑{-q});\eqno (21)$
|qctr \4skip u ~ v\quad (ε)\quad if and only if\quad |$v - u|
≤ ε$ max($b↑e|infsup u↑{-q}, b↑e|infsup v↑{-q});\eqno (22)$
|qctr \4skip u ?? v\quad (ε)\quad if and only if\quad $u - v
> ε$ max$(b↑e|infsup u↑{-q}, b↑e|infsup v↑{-q});\eqno (23)$
|qctr \4skip $u \approx v\quad (ε)\quad$ if and only if\quad
$|v - u| ≤ ε$ min$(b↑e|infsup u↑{-q}, b↑e|infsup v↑{-q}).\eqno
(24)$
|qctr \9skip These definitions apply to unnormalized values
as well as to normalized ones. Note that exactly less than),
$u ~ v$ (approximately equal to), or $u ?? v$ (definitely greater
than) must always hold for any given pair of values $u$ and
$v$. The relation $u \approx v$ is somewhat stronger than $u
~ v$, and it misht be read ``$u$ is essentially equal to $v$.''
All of the relations are given in terms of a positive real number
$ε$ which measures the degree of approximation being considered.

One way to view the above definitions is to associate
a set $S(u) = \{x | |x - u| ≤ εb↑e|infsup u↑{-q}\}$ with each
floating-point number $u; S(u)$ represents a set of values near
$u$ based on the exponent of $u$'s floating-point representation.
In these terms, we have $u ?? v$ if and only if $S(u) < v$ and
$u < S(v); u ~ v$ if and only if $u \in S(v)$ or $v \in S(u);
u ?? v$ if and only if $u > S(v)$ and $S(u) > v; u \approx v$
if and only if $u \in S(v)$ and $v \in S(u)$. (Here we are assuming
that the parameter $ε$, which measures the degree of approximation,
is a constant; a more complete notation would indicate the dependence
of $S(u)$ upon $ε.)$

Here are some simple consequences of the above
definitions:
$$if\qquad $u ?? v\quad (ε)\qquad$ then\qquad $v ?? u\quad (ε)$;
|qctr \4skip if\qquad $u \approx v\quad (ε)\qquad$ then\qquad
$u ~ v\quad (ε);\eqno (26)$
|qctr \4skip u \approx u\quad (ε);
|qctr \4skip if\qquad $u ?? v\quad (ε)\qquad$ then\qquad $u
< v\quad \eqno (28)$
|qctr \6skip if\qquad $u ?? v\quad (ε↓1)\quad$ and\quad $ε↓1
≥ ε↓2\qquad$ then\qquad $u ?? v\quad (ε↓2);\eqno (29)$
|qctr \4skip if\qquad $u ~ v\quad (ε↓1)\quad$ and\quad $ε↓1
≤ ε↓2\qquad$ then\qquad $u ~ v\quad (ε↓2);\eqno (30)$
|qctr \4skip if\qquad $u \approx v\quad (ε↓1)\quad$ and\quad
$ε↓1 ≤ ε↓2\qquad$ then\qquad $u \approx v\quad (ε↓2);\eqno (31)$
|qctr \4skip if\qquad $u ?? v\quad (ε↓1)\quad$ and\quad $v ??
w\quad (ε↓2)\qquad$ then\qquad $u ?? w\quad (ε↓1 + ε↓2);\eqno
(32)$
|qctr \4skip if\qquad $u \approx v\quad (ε↓1)\quad$ and\quad
$v \approx w\quad (ε↓2)\qquad$ then\qquad $u ~ w\quad (ε↓1 +
ε↓2).\eqno (33)$
|qctr \9skip Moreover, we can prove without difficulty that
$$$|u - v| ≤ ε|u|\quad$ and\quad $|u - v| ≤ ε|v|\qquad$ implies\qquad
$u \approx \quad (ε);\eqno (34)$
|qctr \4skip |u - v| ≤ ε|u|\quad and\quad $|u - v| ≤ ε|v|\qquad$
implies\qquad $u \approx v\quad (ε);\eqno (34)$
|qctr \4skip |u - v| ≤ ε|u|\quad or\quad $|u - v| ≤ ε|v|\qquad$
implies\qquad $u ~ v\quad (ε);\eqno (35)$
|qctr \9skip and conversely, for {\sl normalized} floating-point
numbers $u$ and $v$, when $ε < 1$,
$$u \approx v\quad (ε)\qquad implies\qquad $|u - v| ≤ bε|u|\quad$
and\quad $|u - v| ≤ bε|v|;\eqno (36)$
|qctr \4skip u ~ v\quad (ε)\qquad implies\qquad $|u - v| ≤ bε|u|\quad$
or\quad $|u - v| ≤ bε|v|.\eqno (37)$
|qctr \9skip \quad Let $ε↓0 = b↑{1-p}$ one {\sl ulp.} From the
derivation of (17) we have $|x $- round$(x)| = |\rho (x)| ≤
{1\over 2}ε↓0$ min($|x|, |$round$(x)|)$, hence
$$$x \approx$ round$(x)\quad ({1\over 2}ε↓0);\eqno (38)$
|qctr \9skip and $u \oplus v \approx u + v({1\over 2}??↓0)$,
etc. The approximate associative law for multiplication derived
above can be recast as follows: We have
$$|($u \otimes v) \otimes w - u \otimes (v \otimes w)| ≤ {2ε↓0\over
(1 - {1\over 2}ε↓0)↑2?? |u \otimes (v \otimes w)|$
|qctr \9skip by (19), and the same inequality is valid with
($u \otimes v) \otimes w$ and $u \otimes (v \otimes w)$ interchanged.
Hence by (34),
$$$(u \otimes v) \otimes w \approx u \otimes (v \otimes w)\quad
(ε)\eqno (39)$
|qctr \9skip whenever $ε ≥ 2ε↓0/(1 - {1\over 2}ε↓0)↑2$. For
example, if $b = 10$ and $p = 8$ we may take $ε = 0.00000021$.
|qleft ???|newtype 583
%folio 301 galley 10 (C) Addison-Wesley 1978
20---Computer Programming\quad (Knuth/A.-W.)\quad f. 301\quad
ch. 4\quad g. 10a
$$\quad The relations ??, ~, ??, and \approx are useful within
numerical algorithms, and it is therefore a good idea to provide
routines for comparing floating-point numbers as well as for
doing arithmetic on them.

Let us now shift our attention back to the question
of finding {\sl exact} relations which are satisfied by the
floating-point operations. It is interesting to note that floating-point
addition and subtraction are not completely intractable from
an axiomatic standpoint, since they do satisfy the nontrivial
identities stated in the following theorems:

\thbegin Theorem A. {\sl Let $u$ and $v$ be normalized floating-point
numbers. Then}
$$\biglp (u \oplus v) \ominus u\bigrp + \biglp (u \oplus v)
\ominus \biglp (u \oplus v) \ominus u)\bigrp = u \oplus v,\eqno
(40)
|qctr \9skip |newtype provided that no exponent overflow or
underflow occurs.

|qleft \12skip This rather cumbersome looking identity can be
rewritten in a simpler manner: Let
$$
|qctr \8skip (41)|zfa
 |qright \-8skip \hfill u↑\prime ⊗= (u \oplus v) \ominus v,\hfill
v↑\prime ⊗= (u \oplus v) \ominus u;\cr
\4skip \hfill u\dprime ⊗= (u \oplus v) \ominus v↑\prime ,\hfill
v\dprime ⊗= (u \oplus v) \ominus u↑\prime .\cr
\9skip Intuitively, $u↑\prime$ and $u\dprime$ should be approximations
to $u$, and $v↑\prime$ and $v\dprime$ should be approximations
to $v$. Theorem A tells us that
$$$u \oplus v = u↑\prime + v\dprime = u\dprime + v↑\prime .\eqno
(42)\cr
\9skip$ This is a stronger statement than the identity
$$$u \oplus v = u↑\prime \oplus v\dprime = u\dprime \oplus v↑\prime
,\eqno (43)$
|qctr \9skip which follows by rounding (42).

|qleft \12skip {\sl Proof. \xskip} Let us say that $t$ is a
{\sl tail} of $x$ modulo $b↑e$ if
$$$t = x($modulo $b↑e),\qquad |t| ≤ {1\over 2}b↑e;\eqno (44)$
|qctr \9skip thus, $x $- round$(x)$ is always a tail of $x$.
The proof of Theorem A rests largely on the following simple
fact proved in exercise 11:

\thbegin Lemma T. {\sl If $t$ is a tail of the floating-point number
$x$ then $x \ominus t = x - t$.}

|qleft \12skip \quad Let $w = u \oplus v$. Theorem A holds trivially
when $w = 0$. By multiplying all variables by a suitable power
of $b$, we may assume without loss of generality that $e↓w =
p$. Then $u + v = w + r$, where $r$ is a tail of $u + v$ modulo
1. Furthermore $u↑\prime =$ round($w - v) =$ round($u - r) =
u - r - t$, where $t$ is a tail of $u - r$ modulo $b↑e$ and
$e = e↓u - p$.

If $e ≤ 0$ then $t \quad u - r \quad -v$ (modulo
$b↑e)$, hence $t$ is a tail of $-v$ and $v\dprime =$ round($w
- u↑\prime ) =$ round($v + t) = v + t$; this proves (40). If
$e > 0$ then $|u - r| ≥ b↑p - {1\over 2}$, and since $|r| ≤
{1\over 2}$ we have $|u| ≥ b↑p - 1$. It follows that $r$ is
a tail of $v$ modulo 1. If $u↑\prime = u$, we have $v\dprime
=$ round($w - u) =$ round($v - r) = v - r$. Otherwise the relation
round($u - r) ≠ u$ implies that $|u| = b↑p - 1, |r| = {1\over
2}, |u↑\prime | = b↑p$; we have $v\dprime =$ round$(w - u↑\prime
) =$ round($v + r) = v + r$ because $r$ is also a tail of $-v$
in this case
$$\quad Theorem A exhibits a regularity property of floating-point
addition, but it doesn't seem to be an especially useful result.
The following identity is more significant:

\thbegin Theorem B. {\sl Under the hypotheses of Theorem
A and (41),}
$$u + v = (u \oplus v) + \biglp (u \ominus u↑\prime ) \oplus
(v \ominus v\dprime )\bigrp .\eqno (45)
|qctr \12skip Proof. \xskip In fact, we can show that $u \ominus
u↑\prime = u - u↑\prime , v \ominus v\dprime = v - v\dprime
$, and $(u - u↑\prime ) \oplus (v - v\dprime ) = (u - u↑\prime
) + (v - v\dprime )$, hence (45) will follow from Theorem A.
Using the notation of the preceding proof, these relations are
respectively equivalent to
$$round($t + r) = t + r,\qquad$ round$(t) = t,\qquad$ round$(r)
= r.\eqno (46)$
|qctr \9skip Exercise 12 establishes the theorem in the special
case $|e↓u - e↓v| ≥ p$. Otherwise $u + v$ has at most 2$p$ significant
digits and it is easy to see that round($r) = r$. If now $e
> 0$, the proof of Theorem A shows that $t = -r$ or $t = r =
\pm {1\over 2}$. If $e ≤ 0$ we have $t + r \quad u$ and $t \quad
-v$ (modulo $b↑e)$; this is enough to prove that $t + r$ and
$r$ round to themselves, provided that $e↓u ≥ e$ and $e↓v ≥
e$. But either $e↓u < 0$ or $e↓v < 0$ would contradict our hypothesis
that $|e↓u - e↓v| < p$, since $e↓w = p$.

Theorem B gives {\sl an explicit formula for the
difference} between $u + v$ and $u \oplus v$, in terms of quantities
that can be calculated directly using five operations of floating-point
arithmetic. If the radix $b$ is 2 or 3, we can improve on this
result, obtaining the exact value of the correction term with
only two floating-point operations and one (fixed-point) comparison
of absolute values:

\thbegin Theorem C. {\sl If $b ≤ 3$ and $|u| ≥ |v|$ then}
$$u + v = (u \oplus v) + \biglp u \ominus (u \oplus v)\bigrp
\oplus v.\eqno (47)
|qctr \12skip Proof. \xskip Following the conventions of preceding
proofs again, we wish to show that $v \ominus v↑\prime = r$.
It suffices to show that $v↑\prime = w - u$, because (46) will
then yield $v \ominus v↑\prime =$ round($v - v↑\prime ) =$ round($u
+ v - w) =$ round$(r) = r$.

We shall in fact prove (47) whenever $b ≤ 3$ and
$e↓u ≥ e↓r$. If $e↓u ≥ p$ then $r$ is a tail of $v$ modulo 1,
hence $v↑\prime = w \ominus u = v \ominus r = v - r = w - u$
as desired. If $e↓u < p$ then we must have $e↓u = p - 1$, and
$w - u$ is a multiple of $b↑{-1}$; it will therefore round to
itself if its magnitude is less than $b↑{p-1} + b↑{-1}$. Since
$b ≤ 3$, we have indeed $|w - u| ≤ |w - u v| + |v| ≤ {1\over
2} + (b↑{p-1} - b↑{-1}) < b↑{p-1} + b↑{-1}$.

|qleft \12skip \quad The proofs of Theorems A, B, and C do not
rely on the precise definitions of round($x)$ in the ambiguous
cases when $x$ is exactly midway between consecutive floating-point
numbers; any way of resolving the ambiguity will suffice for
the validity of everything we have proved so far. No rounding
rule can be best for every application; for example, we generally
want a special rule when computing our income tax. But for most
numerical calculations the best policy appears to be the rounding
scheme specified in Algorithm 4.2.1 N, which insists that the
least significant digit should always be made even (or always
odd) when an ambiguous value is rounded. This is not a trivial
technicality, of interest only to nit-pickerss; it is an important
practical consideration, since the ambiguous case arises surprisingly
often and a biased rounding rule produces significantly poor
results. For example, consider decimal arithmetic and assume
that remainders of 5 are always rounded upwards. Then if $u
= 1.0000000$ and $v = 0.55555555$ we have $u \oplus v = 1.5555556$;
and if we floating-subtract $v$ from this result we get $u↑\prime
= 1.0000001$. Adding and subtracting $v$ from $u↑\prime$ gives
1.0000002, and the next time we will get 1.000003, etc.; the
result keeps growing although we are adding and subtracting
the same value.

This phenomenon, called {\sl drift,} will not occur
when we use a stable rounding rule based on the parity of the
least significant digit. More precisely:

\thbegin Theorem D. $\biglp ()u \oplus v) \ominus v) \oplus
v\bigrp \ominus v = (u \oplus v) \ominus v$.

|qleft \12skip For example, if $u = 1.2345679$ and $v = -0.23456785$,
we find $u \oplus v = 1.0000000, (u \oplus v) \ominus v = 1.2345678,
\biglp (u \oplus v) \ominus v\bigrp \oplus v = 0.99999995, \biglp
((u \oplus v) \ominus v) \oplus v\bigrp \ominus v = 1.2345678$.
The proof for general $u$ and $v$ seems to require a case analysis
even more detailed than that in the above theorems; see the
references at the end of this section.
|qleft ????????????|newtype 58320---Compute
%folio 300 galley 11 (C) Addison-Wesley 1978
r Programming\quad (Knuth/A.-W.)\quad f. 300\quad ch. 4\quad
g. 11a
$$\quad Theorem D is valid both for ``round to even'' and ``round
to odd''; how should we choose between these possibilities?
When the radix $b$ is odd, ambiguous cases never arise
except during floating-point division, and the rounding
in such cases is comparatively unimportant. Experience shows
that the best rule for {\sl even} radices is, ``Round to even
when $b/2$ is odd, round to odd when $b/2$ is even.'' The least
significant digit of a floating-point fraction occurs frequently
as a remainder to be rounded off in subsequent calculations,
and this rule avoids generating the digit $b/2$ in the least
significant position whenever possible; its effect is to provide
some memory of an ambiguous rounding so that subsequent rounding
will tend to be unambiguous. For example, if we were to round
to odd in the decimal system, repeated rounding of the number
2.44445 to one less place each time leads to the sequence 2.4445,
2.445, 2.45, 2.5, 3; but if we round to even, such situations
do not occur. [Roy A. Keir, {\sl Inf.\ Proc.\ Letters \bf 3} (1975),
188--189.]

A reader who has checked some of the details of
the above proofs will realize the immense simplification that
has been afforded by the simple rule $u \oplus v =$ round($u
+ v)$. If our floating-point addition routine would tail to
give this result even in a few rare cases, the proofs would
become enormously more complicated and perhaps they would even
break down completely.

Theorem B faills if truncation arithmetic is used
in place of rounding, i.e., if we let $u \oplus v =$ trunc$(u
+ v)$ and $u \ominus v =$ trunc($u - v)$, where trunc($x)$ takes
all positive real $x$ into the largest floating-point number
≤$x$. An exception to Theorem B would then occur for cases such
as (20, +.10000001) \oplus (10, -.10000001) = (20, +.10000000),
when the difference between $u + v$ and $u \oplus v$ cannot
be expressed exactly as a floating-point number; and also for
cases such as 12345678 \oplus .012345678, when it can be.

Many people feel that, since floating-point arithmetic
is inexact by nature, there is no harm in making it just a little
bit less exact in certain rather rare cases, if it is convenient
to do so. This policy saves a few cents in the design of computer
hardware, or a small percentage of the average running time
of a subroutine. But the above discussion shows that such a
policy is mistaken. We could save about five percent of the
running time of the FADD subroutine, Program 4.2.1A, and about
25 percent of its space, if we took the liberty of rounding
incorrectly in a few cases, but we are much better off leaving
it as it is. The reason is not to glorify ``bit chasing'' fundamental
issue is at stake here: {\sl Numerical subroutines should deliver
results which satisfy simple, useful mathematical laws whenever
possible.} The crucial formula $u \oplus v =$ round($u + v)$
is a ``regularity'' property which makes a great deal of difference
between whether mathematical analysis of computational algorithms
is worth doing or worth avoiding. Without any underlying symmetry
properties, the job of proving interesting results becomes extremely
unpleasant. {\sl The enjoyment of one's tools is an essential
ingredient of successful work.}

\subsectionbegin{B. Unnormalized floating-point arithmetic}
The policy of
normalizing all floating-point numbers may be construed in two
ways: We may look on it favorably by saying it is an attempt
to get the maximum possible accuracy obtainable with a given
degree of precision, or we may consider it to be potentially
dangerous since it tends to imply that the results are more
accurate than they really are. When we normalize the result
of $(1, +.31428571) \ominus (1, +.31415927)$ to $(-2, +.12644000)$,
we are suppressing information about the possibly greater inaccuracy
of the latter quantity. Such information would be retained if
the answer were left as $(1, +.00012644)$.

The input data to a problem is frequently not known
as precisely as the floating-point representation allows. For
example, the values of Avogadro's number and Planck's constant
are not known to eight significant digits, and its might be
more appropriate to denote them, respectively, by
$$(27, +.00060225)\qquad and\qquad (-23, +.00010545)
|qctr \9skip instead of by $(24, +.60225200)$ and $(-26, +.10545000)$.
It would be nice if we could give our input data for each problem
in an unnormalized form which expresses how much precison is
assumed, and if the output would indicate just how much precision
is known in the naswer. Unfortunately, this is a terribly difficult
proble, although the use of unnormalized arithmetic can help
to give some indication. For example, we can say with a fair
degree of certainty that the product of Avogadro's number by
Planck's constant is (0, +.00063507), and their sum is (27,
+.00060225). (The purpose of this example is not to suggest
that any important physical significance should be attached
to the sum and product of these fundamental constants; the point
is that it is possible to preserve a little of the information
about precision in the result of calculation with imprecise
quantities, when the original operands are independent of each
other.)

The rules for unnormalized arithmetic are simply
this: Let $l↓u$ be the number of leading zeros in the fraction
part of $u = (e↓u, f↓u)$, so that $l↓u$ is the largest integer
$≤p$ with $|f↓u| < b↑{-l}|infsup u$. Then addition and subtraction
are performed just as in Algorithm 4.2.1A, except that all scaling
to the left is suppressed. Multiplication and division are performed
as in Algorithn 4.2.1M, except that the answer is scaled right
or left so that precisely max$ (l↓u, l↓v)$ leading zeros appear.
Essentially the same rules have been used in manual calculation
for many years.

It follows that, for unnormalized computations,
$$
|qctr \hfill $e↓{u|}↔V↓v, e↓{u|}↔B↓v ⊗=$ max($e↓u, e↓v) + (0$
or 1)\eqno (48)\cr
\5skip \hfill $e↓{u|}↔N↓v ⊗= e↓u + e↓v - q $- min($l↓u, l↓v)
- (0$ or 1)\eqno (49)\cr
\4skip \hfill $e↓{u|}↔n↓v ⊗= e↓u - e↓v + q - l↓u + l↓v +$ max($l↓u,
l↓v) + (0$ or 1).\eqno (50)\cr
\9skip When the result of a calculation is zero, an unnormalized
zero (often called an ``order of magnitude zero'') is given
as the answer; this indicates that the answer may not truly
be zero, we just don't known any of its significant digits.

Error analysis takes a somewhat different form
with unnormalized floating-point arithmetic. Let us define
$$$\delta ↓u = {1\over 2}b↑{e↓u-q-p}\qquad$ if\qquad $u = (e↓u,
f↓u).\eqno (51)$
|qctr \9skip This quantity depends on the representation of
$u$, not just on the value $b↑e|infsup u↑{-q}f↓u$. Our rounding
rule tells us that
|qleft ??1\9skip $|u \oplus v - (u + v)| ≤ \delta ↓{u|}↔V↓v,\qquad
|u \ominus v - (u - v)| ≤ \delta ↓{u|}↔B↓v$,
|qctr \5skip |u \otimes v - (u \times v)| ≤ \delta $↓{u|}↔N↓v,\qquad
|u \odiv v - ( / v)| ≤ \delta ↓{u|}↔n↓v$.
|qctr \9skip These inequalities apply to normalized as well
as unnormalized arithmetic; the main difference between the
two types of error analysis is the definition of the exponent
of the result of each operation (Eqs.\ (48) to (50)\bigrp .

We have remarked that the relations ??, ~, ??,
and \approx defined earlier in this section are valid and meaningful
for unnormalized numbers as well as for normalized numbers.
As an example of the use of these relations, let us prove an
approximate associative law for unnormalized addition, analogous
to (39):
$$$(u \oplus v) \oplus w \approx u \oplus (v \oplus w)\qquad
(ε),\eqno (52)$
|qctr \9skip for suitable $ε$. We have
$$$|(u \oplus v) \oplus w - (u + v + w)| |tab ≤ |(u \oplus v)
\oplus w - \biglp (u \oplus v) + w\bigrp |⊗\4skip ⊗\quad + |u
\oplus v - (u + v)|⊗\4skip ⊗≤ \delta ↓{(u \oplus v) \oplus w}+
\delta ↓{u|}↔V↓v\cr
\4skip ⊗≤ 2\delta ↓{(u \oplus v) \oplus w}.\cr
\9skip$ A similar formula holds for $|u \oplus (v \oplus w)
- (u + v + w)|$. Now since $e↓{(u \rangle v) \oplus w} =$ max$(e↓u,
e↓v, e↓w) + (0, 1$, or 2), we have $\delta ↓{(u \oplus v) \oplus
w)} ≤ b↑2\delta ↓{u \oplus (v \oplus w)}$. Therefore we find
that (52) is valid when $ε ≥ 2b↑{2-p}$; unnormalized addition
is not as erratic as normalized addition with respect to the
associative law.
|qleft β?????????|newtype 58320---Computer Programming\quad
(Knuth
%folio 306 galley 12 (C) Addison-Wesley 1978
/A.-W.)\quad ch. 4\quad f. 306\quad g. 12a
$$\quad It should be emphasized that unnormalized arithmetic
is by no means a panacea; there are examples where it indicates
greater accuracy than is present (e.g., adding up a great many
small quantities of about the same magnitude, or finding the
$n$th power of a number for large $n)$, and there are many more
examples when it indicataes poor accuracy while normalize arithmetic
actually does produce good results. There is an important reason
why no straightforward one-operation-at-a-time method of error
analysis is really reliable, namely the fact that operands are
usually not independent of each other. This means that errors
tend to cancel or reinforce each other in strange ways. For
example, suppose that $x$ is approximately ${1\over 2}$, and
suppose that we have an approximation $y = x + \delta$ with
absolute error $\delta $. If we now wish to compute $x(1 - x)$,
we can form $y(1 - y)$; if $x = {1\over 2} + ε$ we find $y(1
- y) = x(1 - x) - 2ε\delta - \delta ↑2$: The error has decreased
by a factor of 2$ε + \delta $. This is just one case where multiplication
of imprecise quantities can lead to a quite accurate result
when the operands are not independent of each other. A more
obvious example is the computation of $x \ominus x$, which can
be obtained with perfect accuracy regardless of how bad an approximation
to $x$ we begin with.

The extra information which unnormalized arithmetic
gives us can often be more important than the information it
destroys during an extended calculation, but (as usual) we must
use it with care. Examples of the proper use of unnormalized
arithmetic are discussed by R. L. Ashenhurst and N. Metropolis,
in {\sl Computers and Computing, AMM} Slaught Memorial Papers,
{\bf 10} (February, 1965), 47--59; and by R. L. Ashenhurst,
in {\sl Error in Digital Computation \bf 2}, ed.\ by L. B.
Rall (New York: Wiley, 1965), 3--37. Appropriate methods for
computing standard mathematical functions with both input and
output in unnormalized form are given by R. L. Ashenhurst in
{\sl JACM \bf 11} (1964), 168--187.

\subsectionbegin{C. Interval arithmetic} Another approach
to the problem of error determination is the so-called ``interval''
or ``range'' arithmetic, in which upper and lower bounds on
each number are maintained during the calculations. Thus, for
example, if we know that $u↓0 ≤ u ≤ u↓1$ and $v↓0 ≤ v ≤ v↓1,
we represent this by the interval notation u = [u↓0, u↓1], v
= [v↓0, v↓1]$. The sum $u \oplus v$ is [$u↓0 \quad v↓0, u↓1
\quad v↓1]$, where \quad denotes ``lower floating-point addition,''
the greatest representable number less than or equal to the
true sum, and \quad is defined similarly (see exercise 4.2.1--13).
Furthermore $u \ominus v = [u↓0 \quad v↓1, u↓1 \quad v↓0]$;
and if $u↓0$ and $v↓0$ are positive, we have $u \otimes v =
[u↓0 \quad v↓0, u↓1 \quad v↓1], u \odiv v = [u↓0 \quad v↓1,
u↓1 \quad v↓0]$. For example, we might represent Avogadro's
number and Planck's constant as
$$$N = |newtype [|newtype (24, +.60222400), (24, +.60228000)|newtype
]|newtype , h = |newtype [|newtype (-26, +.10544300), (-26,
+.10545700)|newtype ]|newtype $;
|qctr \9skip their sum and product would then turn out ot be
$$$N \oplus h = |newtype [|newtype (24, +.60222400), (24, +.60228001)|newtype
]|newtype , N \otimes h = |newtype [|newtype (-3, +.63500305),
(-3, +.63514642)|newtype ]|newtype $.
|qctr \9skip \9skip \quad If we try to divide by [$v↓0, v↓1]$
when $v↓0 < 0 < v↓1$, there is a possibility of division by
zero. Since the philosophy underlying interval arithmetic is
to provide rigorous error estimates, a divide-by-zero error
should be signalled in this case. However, overflow and underflow
need not be treated as errors in interval arithmetic, if special
conventions are introduced as discussed in exercise 24.

Interval arithmetic takes only about twice as long
as ordinary arithmetic, and it provides truly reliable error
estimates. Considering the difficulty of mathematical error
analyses, this is indeed a small price to pay. Since the intermediate
values in a calculation often depend on each other, as explained
above, the fianal estimates obtained with interval arithmetic
will tend to be pessimistic; and iterative numerical methods
often have to be redesigned if we want to deal with intervals.
The prospects for effective use of interval arithmetic look
very good, however, and efforts should be made to increase it
availability.

\subsectionbegin{D. History and bibliography} Jules
Tannery's classic treatise on decimal calculations, $le|spose
&c0ns d'Arithm|spose 1etique$ (Paris: Colin, 1894), stated that
positive numbers should be rounded upeards if the first discarded
digit is 5 or more; since exactly half of the decimal digits
are 5 or more, he felt that this rule rounds exactly half the
time and produces compensating errors. The idea of ``round to
even'' in the ambiguous cases seems to have been mentioned first
by James B. Scarborough in the first edition of his pioneering
book {\sl Numerical Mathematical Analysis} (Baltimore: Johns
Hopkins Press, 1930), p.\ 2; in the second (1950) edition he
amplified his earlier remarks, stating that ``It should be obvious
to any thinking person that when a 5 is cut off, the preceding
digit should be increased by 1 in only {\sl half} the cases,''
and he recommended round-to-even in order to achieve this.

The first analysis of floating-point arithmetic
was given by F. L. Bauer and K. Samelson, {\sl Zeitschrift f\"ur
angewandte Math.\ und Physik \bf 5} (1953), 312--316. The
next publication was not until over five years later: J. W.
Carr III, {\sl CACM \bf 2}, 5 (May, 1959), 10--15. See also
P. C. Fischer, {\sl Proc.\ ACM Nat.\ Meeting \bf 13} (Urbana,
Illinois, 1958), paper 39. The book {\sl Rounding Errors in
Algebraic Processes} (Englewood Cliffs: Prentice-Hall, 1963),
by J. H. Wilkinson, shows how to apply error analysis of the
individual arithmetic operations to the error analysis of large-scale
problems; see also his treatise on {\sl The Algebraic Eigenvalue
Problem} (Oxford: Clarendon Press, 1965).

The relations ??, ~, ??, \approx introduced in
this section are similar to ideas published by A. van Wijngaarden
in {\sl BIT \bf 6} (1966), 66--81. Theorems A and B above were
inspired by some related work of Ole M|spose ??oller, {\sl BIT
\bf 5} (1965), 37--50, 251--255. Theorem C is due to T. J. Dekker,
{\sl Numer.\ Math.\ \bf 18} (1971), 224--242; extensions and refinements
of all three theorems have been published by S. Linnainmaa,
{\sl BIT \bf 14} (1974), 167--202. W. M. Kahan introduced
Theorem D in some unpublished notes; for a complete proof and
further commentary see J. F. Reiser and D. E. Knuth, {\sl Inf.\
Proc.\ Letters \bf 3} (1975), 84--87, 164.

Unnormalized floating-point arithmetic was recommended
by F. L. Bauer and K. Samelson in the article cited above, and
it was independently used by J. W. Carr III at the University
of Michigan in 1953. Several years later, the MANIAC III computer
was designed to include both kinds of arithmetic in its hardware;
see R. L. Ashenhurst and N. Metropolis, {\sl JACM \bf 6} (1959),
415--428; {\sl IEEE Transactions \bf EC-12}
(1963), 896--901; R. L. Ashenhurst, {\sl Proc.\ Spring Joint
Computer Conf.\ \bf 21} (1962), 195--202. See also H. L. Gray
and C. Harrison, Jr., {\sl Proc.\ Eastern Joint Computer Conf.\
\bf 16} (1959), 244--248, and W. G. Wadey, {\sl JACM \bf 7}
(1960), 129--139, for further early discussions of unnormalized
arithmetic.

For early developments in interval arithmetic,
and some modifications, see A. Gibb, {\sl CACM \bf 4} (1961),
319--320; B. A. Chartres, {\sl JACM \bf 13} (1966), 386--403;
and the book {\sl Interval Analysis} by Ramon E. Moore (Prentice-Hall,
1966).

More recent work on floating-point accuracy is
summarized in two important papers which can be especially recommended
for further study: W. M. Kahan, {\sl Proc.\ IFIP Congress (1971),
\bf 2,} 1214--1239; R. P. Brent, {\sl IEEE Transactions} {\bf C-22}
(1973), 601--607. Both papers include useful theory and demonstrate
that it pays off in practice.

|qleft \24skip ??????{\:r EXERCISE

|qleft }\12skip |newtype Normalized floating-point arithmetic
is assumed unless the contrary is specified.)

\exno 1. [M18] Prove that identity (7) is a consequence of (2)
through (6).

\exno 2. [M20] Use identities (2) through (8) to prove that
$(u \oplus x) \oplus (v \oplus y) ≥ u \oplus v$ whenever $x
≥ 0$ and $y ≥ 0$.

\exno 3. [M20] Find eitht-digit floating-decimal numbers $u,
v$, and $w$ such that
$$$u \otimes (v \otimes w) ≠ (u \otimes v) \otimes w$,
|qctr \9skip and no exponent overflow or underflow occurs during
any of these computations.

\exno 4. [10] It is possible to have floating-point numbers
$u, v$, and $w$ for which exponent overflow occurs during the
calculation of $u \otimes (v \otimes w)$ but not during the
calculation of $(u \otimes v) \otimes w?$

\exno 6. [M22] Are either of the following two identities valid
for all floating-point numbers $u?$\xskip (a) $0 \ominus (0 \ominus
u) = u$.\xskip (b) $1 \odiv (1 \odiv u) = u$.

\exno 7. [M21] Let $u|spose ↑|??2`↑|≤$ stand for $u \otimes
u$. Find floating-binary numbers $u$ and $v$ such that $2(u|spose
↑|??2`↑|≤ \oplus v|spose ↑|??2`↑2) < (u \oplus v)|spose ↑|??2`↑2$.

\exno 8. [20] Let $ε = 0.0001$; which of the relations $u ??
v (ε), u ?? v (ε), u \approx (ε)$ hold for the following pairs
of base 10, excess 0, eight-digit floating-point numbers?

|qleft \3skip \qquad a) $u = ( 1, +.31415927), v = ( 1, +.31416000)$;
|qleft \qquad b) u = ( 0, +.99997000), v = ( 1, +.10000039);
|qleft \qquad c) u = (24, +.60225200), v = (27, +.00060225);
|qleft \qquad d) u = (24, +.60225200), v = (31, +.00000006);
|qleft \qquad e) u = (24, +.60225200), v = (32, +.00000000).

\exno 9. [M22] Prove (33) and explain why the conclusion cannot
be strngthened to $u \approx w(ε↓1 + ε↓2)$.
|qleft
%folio 313 galley 13 (C) Addison-Wesley 1978
???|newtype 58320---Computer Programming\quad (Knuth/A.-W.)\quad
ch. 4\quad f. 313\quad g. 13a
\exno 10. [M25] (W. M. Kahan.)\xskip A certain computer
performs floating-point arithmetic without proper rounding,
and, in fact, its floating-point multiplication routine ignores
all but the first $p$ most significant digits of the $2p$-digit
product $f↓uf↓v$. (Thus when $f↓uf↓v < 1/b$, the least-significant
digit of $u \otimes v$ always comes out to be zero, due to subsequent
normalization.) Show that this causes the monotonicity of multiplication
to fail; i.e., there are positive normalized floating-point
numbers $u, v, w$ such that $u < v$ but $u \otimes w > v \otimes
w$.

\exno 11. [M20] Prove Lemma T.

\exno 12. [M24] Carry out the proof of Theorem B and (46) when
|$e↓u - e↓v| ≥ p$.

\exno 13. [M25] Some programming languages (and even some computers)
make use of floating-point arithmetic only, with no provision
for exact calculations with integers. If operations on integers
are desired, we can, of course, represent an integer as a floating-point
number; and when the floating-point operations satisfy the basic
definitions in (9), we know that all floating-point operations
will be exact,\3skip |newtype
\exno 17. [28] Write a \MIX\ subroutine, FCMP, which compares the floating-point
number $u$ in location ACC with the floating-point number $v$
in register A, and which sets the comparison indicator to LESS,
EQUAL, or GREATER, according as $u ?? v, u ~ v$, or $u ?? v
(ε)$; here $ε$ is stored in location EPSILON as a nonnegative
fixed-point quantity with the decimal point assumed at the left
of the word. Assume normalized inputs.

\exno 18. [M40] In unnormalized arithmetic is there a suitable
number $ε$ such that
$$$u \otimes (v \oplus w) \approx (u \otimes v) \oplus (u \otimes
w)\qquad (ε)?$

\exno 19. [M30] (W. M. Kahan.)\xskip Consider
the following procedure for floating-point summation of $x↓1,
\ldotss , x↓n$:
$$s$↓0 = c↓0 = 0;\quad y↓k = x↓k \ominus c↓{k-1}, s↓k = s↓{k-1}
\oplus y↓k, c↓k = (s↓k \ominus s↓{k-1}) \ominus y↓k,\qquad$
for\qquad $1 ≤ k ≤ n$.
|qctr \9skip Let the relative errors in these operations be
defined by the equations
|qleft $y↓k = (x↓k - c↓{k-1})(1 + \eta ↓k),\quad s↓k = (s↓{k-1}
+ y↓k)(1 + \sigma ↓k),\quad c↓k = ((s↓k - s↓{k-1})(1 + \gamma
↓k) - y↓k)(1 + \delta ↓k)$,
|qctr |newtype \9skip where $|\gamma ↓k|, |\eta ↓k|, |\sigma
↓k| ≤ ε$. Prove that $s↓n = \sum ↓{1≤k≤n} (1 + \theta ↓k)x↓k$,
where $|\theta ↓k| ≤ 2ε + O(nε↑2)$. [Theorem C says that if
$b = 2$ and $|s↓{k-1}| ≥ |y↓k|$ we have $s↓{k-1} + y↓k = s↓k
- c↓k$ exactly. But in this exercise we want to obtain an estimate
which is valid {\sl even when floating-point operations are
not carefully rounded,} assuming only that each operation has
bounded relative error.]

\exno 20. [25] (S. Linnainmaa.)\xskip Find all $u, v$ for which |$u|
≥ |v|$ and (47) fails.

\exno 21. [M35] (T. J. Dekker.)\xskip Theorem C shows how to do exact
addition of floating-binary numbers. Explain how to do {\sl
exact multiplication:} Express the product $uv$ in the form
$w + w↑\prime $, where $w$ and $w↑\prime$ are computed from
two given floating-binary numbers $u$ and $v$, using only the
operations \oplus , \ominus , and \otimes .

\exno 22. [M30] Can drift occur in floating-point multiplication/division?
Consider the sequence $x↓0 = u, x↓{2n+1} = x↓{2n} \otimes v,
x↓{2n+2} = x↓{2n+1} \odiv v$, given $u$ and $v$; what is the
largest $k$ such that $x↓k ≠ x↓{k+2}$ is possible?

\exno 23. [M26] Prove or disprove: $u \odiv (u$ mod 1) = \lfloor
u\rfloor for all floating-point $u$.

\exno 24. [M27] (W. M. Kahan.)\xskip Define appropriate arithmetic
operations on intervals [$a, b]$, where $a$ and $b$ are either
nonzero floating-point numbers or the special symbols +0, -0,
+∞, -∞; furthermore $a ≤ b$, and $a = b$ implies that $a$ is
finite and nonzero. This interval stands for all floating-point
$x$ such that $a ≤ x ≤ b$, where we regard $-∞ < -u < -0 < 0
+0 < +u < +∞$ for all positive $u$. (Thus, [1, 2] means $1 ≤
x ≤ 2, [+0, 1]$ means $0 < x ≤ 1, [-0, 1]$ means $0 ≤ x ≤ 1,
[-0, +0]$ denotes the single value 0, and [-∞, +∞] stands for
everything.)

\exno 25. [15] When people speak about inaccuracy in floating-point
arithmetic they often ascribe errors to ``cancellation'' which
occurs during the subtraction of nearly equal quantities. But
when $u$ and $v$ are approximately equal, the difference $u
\ominus v$ is obtained exactly, with no error. What do these
people really mean?

\exno 26. [M25] (W. M. Kahan.)\xskip Let $f(x) = 1 + x + x↑2 + \cdots
+ x↑{106} = (1 - x↑{107})/(1 - x)$ for $x < 1$, and let $g(y)
= f\biglp ({1\over 2} - y↑2)(3 + 3.45y↑2)|newtype )$ |newtype
for 0 < $y < 1$. Evaluate $g(y)$ on one or more ``pocket calculators,''
for $y = 10↑{-3}, 10↑{-4}, 10↑{-5}, 10↑{-6}$, and explain all
inaccuracies in the results you obtain. [Since present-day calculators
do not round correctly, the results are often surprising. Note
that $g(ε) = 107 - 10491.35ε↑2 + O(ε↑4).]$
|qleft ??????|newtype 58320---Computer
%folio 317 galley 14 (C) Addison-Wesley 1978
Programming\quad (Knuth/A.-W.)\quad ch. 4\quad f. 317\quad g.
14a
|qleft ??{\:r 4.2.3. Double-Precision Calculations

|qleft }\6skip Up to now we have considered ``single-precision''
floating-point arithmetic, which essentially means that the
floating-point values we have dealt with can be stored in a
single machine word. When single-precision floating-point arithmetic
does not yield sufficient accuracy for a given application,
the precision can be increased by suitable programming techniques
which use two or more words of memory to represent each number.
Although we shall discuss the general question of high-precision
calculations in Section 4.3, it is appropriate to give a separate
discussion of double-precision here; special techniques apply
to double precision which are comparatively inappropriate for
higher precisions, and double precision is a reasonably important
topic in its own right, since it is the first step beyond single
precision and it is applicable to many problems which do not require
extremely high precision.

Double-precision calculations are almost always
required for floating-point, rather than fixed-point arithmetic,
except perhaps in statistical work where fixed-point double-precision
arithmetic is commonly used to calculate sums of squares and
cross products; since fixed-point versions of double-precision
arithmetic are simpler than floating-point versions, we will
confine our discussion here to the latter.

Double precision is quite frequently desired not
only to extend the precision of the fraction parts of floating-point
numbers, but also to increase the range of the exponent part.
Thus we shall deal in this section with the following two-word
format for double-precision floating-point numbers in the \MIX\
computer:
$$\pm $\qquad e\qquad e\qquad f\qquad f\qquad f\qquad \qquad
f\qquad f\qquad f\qquad f\qquad f\quad .\eqno (1)$
|qctr \12skip Here two bytes are used for the exponent and eight
bytes are used for the fraction. The exponent is ``excess $b↑2/2$,''
where $b$ is the byte size. The sign will appear in the most
significant word; it is convenient to ignore the sign of the
other word completely.

Our discussion of double-precision arithmetic will
be rather machine-oriented, because it is only by studying the
problems involved in coding these routines that a person can
properly appreciate the subject. A careful study of the \MIX\
programs below is therefore essential to the understanding of
this section.

In this section we shall depart from the idealistic
goals of accuracy stated in the previous two sections; our double-precision
routines will {\sl not} round their results, and a little bit
of error will sometimes be allowed to creep in. Users dare not
trust these routines too much. There was ample reason to squeeze
out every possible drop of accuracy in the single-precision
case, but now we face a different situation:\xskip (a) The extra programming
required to ensure true double-precision rounding in all cases
is considerable; fully accurate routines would take, say, twice
as much space and half again as much more time. It was comparatively
easy to make our single-precision routines perfect, but double
precision brings us face to face with our machine's limitations.
[A similar situation occurs with respect to other floating-point
subroutines; we can't expect the routine to compute round(cos
$x)$ exactly for all $x$, since that turns out to be virtually
impossible. Instead, the cosine routine should provide the best
relative error it can achieve with reasonable speed, for all
reasonable values of $x$; and the designer of the routine should
try to make the computed function satisfy simple mathematical
laws whenever possible (e.g., $\cos (-x) = \cos x$, $|\cos x|
≤ 1$, cos x ≥ cos $y$ for $0 ≤ x ≤ y < π).]$\xskip (b) Single-precision
arithmetic is a ``staple food'' that everybody who wants to
employ floating-point arithmetic must use, but double precision
is usually for situations where such clean results aren't as
important. The difference between seven- and eight-place accuracy
can be noticeable, but we rarely care about the difference between
15- and 16-place accuracy. Double precision is most often used
for intermediate steps during the calculation of single-precision
results; its full potential isn't needed.\xskip (c) It will be instructive
for us to analyze these procedures in order to see how inaccurate
they can be, since they typify the types of shor cuts generally
taken in bad single-precision routines (see exercises 7 and
8).

Let us now consider addition and subtraction operations
from this standpoint. Double-precision addition differs from
the single-precision case in the following respects: The addition
is performed by separately adding together the least-significant
halves and the most-significant halves, taking care of carries
appropriately. But since we are doing signed-magnitude arithmetic,
it is possible to add the least-significant halves and to get
the wrong sign (namely, when the signs of the operands are opposite,
and the least-significant half of the smaller operand is bigger
than the least-significant half of the larger operand); so it
is helpful to know what sign to expect. Therefore in step A2
(cf.\ Algorithm 4.2.1A), we not only assume that $e↓u ≥ e↓v$,
we also assume that |$u| ≥ |v|$. This means we can be sure that
the final sign will be the sign of $u$. In other respects, the
double-precision addition is very much like the single-precision
routine, only everything is done twice.

\algbegin Program A (Double-precision addition).
The subroutine DFADD adds a double-precision floating-point
number $v$, having the form (1), to a double-precision floating-point
number $u$, assuming that $v$ is initially in rAX (i.e., registers
A and X), and that $u$ is initially stored in locations ACC
and ACCX. The answer appears both in rAX and in (ACC, ACCX).
The subroutine DFSUB subtracts $v$ from $u$ under the same conventions.
Both input operands are assumed to be normalized, and the answer
is normalized. The last portion of this program is a double-precision
normalization procedure which is used by other subroutines of
this section. Exercise 5 shows how to improve this program significantly.

|qleft \12skip |newtype |tab \qquad |tab \qquad \qquad \quad
|tab \qquad \qquad |tab \qquad \qquad \qquad \quad |tab \qquad
\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
|tab |zfa ⊗${\sl 01⊗}ABS⊗EQU⊗1:5⊗$Field definition for absolute
value\cr
${\sl 02⊗$}SIGN⊗EQU⊗0:0⊗Field definition for sign\cr
${\sl 03⊗$}EXPD⊗EQU⊗1:2⊗Double-precision exponent field\cr
${\sl 04⊗$}DFSUB⊗STA⊗TEMP⊗Double-precision subtraction:\cr
${\sl 05⊗$}⊗LDAN⊗TEMP⊗Change sign of $v.\cr
{\sl 06⊗$}DFADD⊗STJ⊗EXITDF⊗Double-precision addition:\cr
${\sl 07⊗$}⊗CMPA⊗ACC(ABS)⊗Compare $|u|$ with $|v|.\cr
{\sl 08⊗}$⊗JG⊗1F\cr
${\sl 09⊗}$⊗JL⊗2F\cr
${\sl 10⊗}$⊗CMPX⊗ACCX(ABS)\cr
${\sl 11⊗}$⊗JLE⊗2F\cr
${\sl 12⊗}$1H⊗STA⊗ARG⊗If |$u| < |v|$, interchange $u ↔ v.\cr
{\sl 13⊗$}⊗STX⊗ARGX\cr
${\sl 14⊗$}⊗LDA⊗ACC\cr
${\sl 15⊗}$⊗LDX⊗ACCX\cr
${\sl 16⊗}$⊗ENT1⊗ACC⊗(ACC and ACCX are in consecutive\cr
${\sl 17⊗}$⊗MOVE⊗ARG(2)⊗\qquad locations.)\cr
$\quad 17⊗}$2H⊗STA⊗TEMP⊗Now ACC has the sign of the answer.\cr
${\sl 19⊗}$⊗LD1N⊗TEMP(EXPD)⊗rI1 ← $-e↓v.\cr
{\sl 20⊗}$⊗LD2⊗ACC(EXPD)⊗rI2 ← $e↓u.\cr
{\sl 21⊗}$⊗INC1⊗0,2⊗rI1 ← e$↓u - e↓v.\cr
{\sl 22⊗}$⊗SLAX⊗2⊗Remove exponent.\cr
${\sl 23⊗}$⊗SRAX⊗1,1⊗Scale right\cr
${\sl 24⊗}$⊗STA⊗ARG⊗0 $v↓1 v↓2 v↓3 v↓4\cr
{\sl 25⊗}$⊗STX⊗ARGX⊗$v↓5 v↓6 v↓7 v↓8 v↓9\cr
{\sl 26⊗}$⊗STA⊗ARGX(SIGN)⊗Store true sign in both halves.\cr
${\sl 27⊗}$⊗LDA⊗ACC\cr
${\sl 28⊗}$⊗LDX⊗ACCX\cr
${\sl 29⊗}$⊗SLAX⊗2⊗Remove exponent.\cr
?????????${\sl 30⊗}$⊗STA⊗ACC⊗$u↓1 u↓2 u↓3 u↓4 u↓5\cr
{\sl 31⊗⊗}SLAX⊗4\cr
{\sl 32⊗}⊗ENTX⊗1\cr
{\sl 33⊗}⊗STX⊗EXPO⊗EXPO ← 1$ (see below).\cr
${\sl 34⊗}⊗SRC⊗1⊗1 u↓5 u↓6 u↓7 u↓8\cr
{\sl 35⊗}⊗STA⊗1F(SIGN)⊗$A trick, see comments in text.\cr
${\sl 36⊗}⊗ADD⊗ARGX(0:4)⊗$Add $0 v↓5 v↓6 v↓7 v↓8.\cr
{\sl 37⊗}⊗SRAX⊗4\cr
{\sl 38⊗}1H⊗DECA⊗1⊗$Recover from inserted 1 (Sign varies)\cr
${\sl 39⊗}⊗ADD⊗ACC(0:4)⊗$Add most significant halves.\cr
${\sl 40⊗}$⊗ADD⊗ARG⊗(Overflow cannot occur)\cr
${\sl 41⊗}$DNORM⊗JANZ⊗1F⊗Normalization routine:\cr
${\sl 42⊗}$⊗JXNZ⊗1F⊗$f↓w$ in rAX, $e↓w = EXPO + rI2.\cr
{\sl 43⊗}$DZERO⊗STA⊗ACC⊗If $f↓w = 0$, set $e↓w ← 0.\cr
{\sl 44⊗}$⊗JMP⊗9F\cr
${\sl 45⊗}$2H⊗SLAX⊗1⊗Normalize to left.\cr
${\sl 46⊗}$⊗DEC2⊗1\cr
${\sl 47⊗}$1H⊗CMPA⊗=0=(1:1)⊗Is the leading byte zero?\cr
${\sl 48⊗}$⊗JE⊗2B\cr
${\sl 49⊗}$⊗JE⊗2B\cr
${\sl 49⊗}$⊗SRAX⊗2⊗(Rounding omitted)\cr
${\sl 50⊗}$⊗STA⊗ACC\cr
${\sl 51⊗}$⊗LDA⊗EXPO⊗Compute final exponent.\cr
${\sl 52⊗}$⊗INCA⊗0,2\cr
${\sl 53⊗}$⊗JAN⊗EXPUND⊗Is it negative?\cr
${\sl 54⊗}$⊗STA⊗ACC(EXPD)\cr
${\sl 55⊗}⊗CMPA⊗=1(3:3)=⊗$Is it more than two bytes?\cr
${\sl 56⊗}$⊗JL⊗8F\cr
${\sl 57⊗}$EXPOVD⊗HLT⊗20\cr
${\sl 58⊗}$EXPUND⊗HLT⊗10\cr
${\sl 59⊗}$8H⊗LDA⊗ACC⊗Bring answer into rA.\cr
${\sl 60⊗}$9H⊗STX⊗ACCX\cr
${\sl 61⊗}$EXITDF⊗JMP??⊗Exit from subroutine.\cr
${\sl 62⊗}$ARG⊗CON⊗0\cr
${\sl 63⊗}$ARGX⊗CON⊗0\cr
${\sl 64⊗}$ACC⊗CON⊗0⊗Floating-point accumulator\cr
${\sl 65⊗}$ACX⊗CON⊗0\cr
${\sl 66⊗}$EXPO⊗CON⊗0Part of ``raw exponent''\cr
????????????|newtype 58320---Computer
%folio 322 galley 15-16 WARNING: End of 15 and beginning of 16 were unreadable. (C) Addison-Wesley 1978
Programming\quad (Knuth/A.-W.)\quad ch. 4\quad f. 322\quad g.
15a
$$\quad When the least-significant halves are added together
in this program, an extra digit ``1'' is inserted at the left
of the word which is known to have the correct sign. After the
addition, this byte can be 0, 1, or 2, depending on the circumstances,
and all three cases are handled simultaneously in this way.
(Compare this with the rather cumbersome method of complementation
which is used in Program 4.2.1A.)

It is worth noting that register A can be zero
after the instruction on line 40 has been performed; and, because
of the way \MIX\ defines the sign of a zero result, the accumulator
contains the correct sign which is to be attached to the result
if register X is nonzero. If lines 39 and 40 were interchanged,
the program would be incorrect (although both instructions are
``ADD'')!

Now let us consider double-precision multiplication.
The product has four components, shown schematically in Fig.\
4. Since we need only the leftmost eight bytes, it is convenient
to work only with the digits to the left of the vertical line
in the diagram, and this means in particular that we need not
even compute the product of the two least significant halves.

|qleft \6skip |newtype {\bf Fig. 4. \xskip} Double-precision
multiplication of witht1byte fraction parts.
\algbegin Program M (Double-precision
multiplication). The input and output conventions for
this subroutine are the same as for Program A.

|qleft \12skip |newtype |tab \qquad |tab \qquad \qquad |tab
\qquad \qquad |tab \qquad \qquad \qquad \qquad \quad |tab \qquad
\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
|tab |zfa ⊗${\sl 01⊗}$BYTE⊗EQU⊗1(4:4)⊗Byte size\cr
${\sl 02⊗}$QQ⊗EQU⊗BYTE*BYTE/2⊗Excess of double-precision exponent\cr
${\sl 03⊗}$DFMUL⊗STJ⊗EXITDF⊗Double-precision multiplication:\cr
${\sl 04⊗}$⊗STA⊗TEMP\cr
${\sl 05⊗}$⊗SLAX⊗2⊗Remove exponent.⊗${\sl 06⊗}$⊗STA⊗ARG⊗$v↓m\cr
{\sl 08⊗}$⊗STX⊗ARGX⊗$v↓l\cr
{\sl 08⊗}⊗LDA⊗TEMP(EXPD)\cr
{\sl 54⊗}⊗\in\rslash \rslash ⊗\in\dprime \dprime (≤∞??\rslash
)\cr
{\sl 10⊗}⊗STA⊗EXPO⊗EXPO ← e↓u + e↓v.\cr
{\sl 11⊗}⊗ENT2⊗-QQ⊗$rI2 ← -QQ\cr
{\sl 12⊗}⊗LDA⊗ACC\cr
{\sl 13⊗}⊗LDX⊗ACCX\cr
${\sl 14⊗}⊗SLAX⊗2⊗$Remove exponent.\cr
${\sl 15⊗}⊗STA⊗ACC⊗u↓m\cr
{\sl 16⊗}⊗STX⊗ACCX⊗u↓l\cr
{\sl 17⊗}⊗MUL⊗ARGX⊗u↓m \times v↓l\cr
{\sl 18⊗}⊗STA⊗TEMP\cr
{\sl 19⊗}⊗LDA⊗ARG(ABS)\cr
{\sl 20⊗}⊗MUL⊗ACCX(ABS)⊗|v↓m \times u↓l|\cr
{\sl 21⊗}⊗SRA⊗1⊗0 x x x x\cr
{\sl 22⊗⊗}ADD⊗TEMP(1:4)⊗($Overflow cannot occur)\cr
${\sl 23⊗}⊗STA⊗TEMP\cr
{\sl 24⊗}⊗LDA⊗ARG\cr
{\sl 25⊗}⊗MUL⊗ACC⊗v↓m \times u↓m\cr
{\sl 26⊗}⊗STA⊗TEMP(SIGN)⊗$True sign of result.\cr
${\sl 27⊗}⊗STA⊗ACC⊗$Now prepare to add all the\cr
${\sl 28⊗}⊗STX⊗ACCX⊗\qquad$ partial products together.\cr
${\sl 29⊗}⊗LDA⊗ACCX(0:4)⊗0 x x x x\cr
{\sl 30⊗}⊗ADD⊗TEMP⊗($Overflow cannot occur)\cr
${\sl 31⊗}⊗SRAX⊗4\cr
{\sl 32⊗}⊗ADD⊗ACC⊗$(Overflow cannot occur)\cr
${\sl 33⊗}⊗JMP⊗DNORM⊗$Normalize and exit.\cr
\12skip |newtype \quad Note the careful treatment of signs in
this program, and note also the fact that the range of exponents
makes it impossible to compute the final exponent using an index
register. Program M is perhaps a little too slipshod in accuracy,
since it throws away all the information to the right of the
vertical line in Fig.\ 4 and this can make the least significant
byte as much as 2 in error. A little more accuracy can be achieved
as discussed in exercise 4.

|qleft \12skip \quad Double-precision floating division is the
most difficult routine, or at least the most frightening prospect
we have encountered so far in this chapter. Actually, it is
not terribly complicated, once we see how to do it; let us write
the numbers to be divided in the form $(u↓m + εu↓l)/(v↓m + εv↓l)$,
where $ε$ is the reciprocal of the word size of the computer,
and where $v↓m$ is assumed to be normalized. The fraction can
now be expanded as follows:
$$
|qctr \hfill ${u↓m + εu↓l\over v↓m + εv↓l}$ ⊗= ${u↓m + εu↓l\over
v↓m}$ \left(${1\over 1 + ε(v↓l/v↓m)}$}\cr
\5skip ⊗= ${u↓m + εu↓l\over v↓m}$ |newtype \left(|newtype 1
- ε\left(${v↓l\over v↓m}$} + ε$↑2\left({v↓l\over v↓m}}↑2 - \cdot
\cdot \cdot |newtype }|newtype .\eqno (2)\cr
\9skip$ Since $0 ≤ |v↓l| < 1$ and $1/b ≤ |v↓m| < 1$, we have
$|v↓l/v↓m| < b$, and the error from dropping terms in $ε↑2$
can be disregarded. Our method therefore is to compute $w↓m
+ εw↓l = (u↓m + εu↓l)/v↓m$, and then to subtract $ε$ times $w↓mv↓l/v↓m$
from the result.

In the following program, lines 27--32 do the lower
half of a double-preision addition, using another method for
forcing the appropriate sign as an alternative to the trick
of Program A.

\algbegin Program D (Double-precision division).
This program adheres to the same conventions as Programs A and
M.

|qleft \12skip |newtype |tab \qquad |tab \qquad \qquad \quad
|tab \qquad \qquad |tab \qquad \qquad \qquad \qquad \qquad \quad
|tab \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
|tab |zfa ⊗${\sl 01⊗}DFDIV⊗STJ⊗EXITDF⊗$Double-precision division:\cr
${\sl 20⊗}⊗LDA⊗ARGX(1:4)\cr
???uble-precision division technique by hand, with ε = {1\over
1000}$, when dividing 180000 by 314159. $\biglp$Thus, let $(u↓m,
u↓l) = (.180, .000)$ and $(v↓m, v↓l) = (.314, .159)$, and find
the quotient using the method suggested in the text following
(2).$bigrp$

\exno 2. [20] Would it
be a good idea to insert the instruction ``ENTX 0'' between
lines 30 and 31 of Program M, in order to keep unwanted information
left over in register X from interfering with the accuracy of
the results?

\exno 3. [M20] Explain
why overflow cannot occur during Program M.

\exno 4. [22] How should Program M be changed so that extra
accuracy is achieved, essentially by moving the vertical line
in Fig.\ 4 over to the right one position? Specify all changes
that are required, and determine the difference in execution
time caused by these changes.

\exno 5. [24] How should Program A be changed so that extra
accuracy is achieved, essentially by working with a nine-byte
accumulator instead of an eight-byte accumulator to the right
of the decimal point? Specify all changes that are required,
and determine the difference in execution time caused by these
changes.

\exno 6. [23] Assume that
the double-precision subroutines of this section and the single-precision
subroutines of Section 4.2.1 are being used in the same main
program. Write a subroutine which converts a single-precision
floating-point number into double-precision form (1), and write
another subroutine which converts a double-precision floating-point
number into single-precision form (or which reports exponent
overflow or underflow if the conversion is impossible).

\exno 7. [M30] Estimate
the accuracy of the double-precision subroutines in this section,
by finding bounds $\delta ↓1, \delta ↓2$, and $\delta ↓3$ on
the relative errors
$$$|\biglp (u \oplus v) - (u + v)|newtype ) h9}/(u + v)|,\qquad
|\biglp (u \otimes v) - (u \times v)\bigrp /(u \times v)|$,
|qctr \4skip |\biglp (u \odiv v) - (u/v)\bigrp /(u/v)|.

\exno 8. [M28] Estimate the accuracy
of the ``improved'' double-precision subroutines of exercises
4 and 5, in the sense of exercise 7.

|qleft \3skip ???|newtype 58320---Computer Programming\quad
(Knuth/
%folio 328 galley 17 (C) Addison-Wesley 1978
A.W.)\quad f. 328\quad ch. 4\quad g. 19a
$$\quad The difference between exponents had a behavior approximately
as the probabilities given in the following table, for various
radices $b$:
$$|newtype ??M22.6}|tab \qquad \qquad \qquad |tab \qquad \qquad
\qquad |tab \qquad \qquad \qquad |tab \qquad \qquad \qquad |tab
\qquad \qquad \qquad |tab |zfa ⊗|e$↓u - e↓v|⊗b = 2⊗b = 10⊗b
= 16⊗b = 64\cr
\-2skip |linerule 0⊗0.33⊗0.47⊗0.47⊗0.56\cr
1⊗0.12⊗0.23⊗0.26⊗0.27\cr
2⊗0.09⊗0.11⊗0.10⊗0.04\cr
3⊗0.07⊗0.03⊗0.02⊗0.02\cr
4⊗0.07⊗0.01⊗0.01⊗0.02\cr
5⊗0.04⊗0.01⊗0.02⊗0.00\cr$
over 5⊗0.28⊗0.13⊗0.11⊗0.09\cr
\-2skip |linerule average⊗3.1 ⊗0.9 ⊗0.8 ⊗0.5 \cr
\12skip |newtype (The ``over 5'' line includes essentially all
of the cases when one operand is zero, but the ``average'' does
not include these cases.)

When $u$ and $v$ have the same sign and are normalized,
then $u + v$ either requires one shift to the {\sl right} (for
fraction overflow), or no normalization shifts whatever. When
$u$ and $v$ have opposite signs, we have zero or more {\sl left}
during the normalization. The following table gives the observed
number of shifts required:
$$|newtype |tab \qquad \qquad \qquad \qquad |tab \qquad \qquad
\qquad |tab \qquad \qquad \qquad |tab \qquad \qquad \qquad |tab
\qquad \qquad \qquad |tab |zfa ⊗⊗$b = 2⊗b = 10⊗b = 16⊗b = 64\cr
\-2skip |linerule$ Shift right 1⊗0.20⊗0.07⊗0.06⊗0.02\cr
No shift⊗0.59⊗0.80⊗0.82⊗0.87\cr
Shift left 1⊗0.07⊗0.08⊗0.07⊗0.06\cr
Shift left 2⊗0.03⊗0.02⊗0.01⊗0.01\cr
Shift left 3⊗0.02⊗0.00⊗0.01⊗0.00\cr
Shift left 4⊗0.02⊗0.01⊗0.00⊗0.01\cr
Shift left >4⊗0.06⊗0.02⊗0.02⊗0.02\cr
t\12skip |newtype The last line includes all cases where the
result was zero. The average number of left shifts per normalization
was about 0.9 when $b =$ 2; 0.2 when $b = 10$ or 16; and 0.1
when $b = 64$.

\subsectionbegin{B. The fraction parts} Further analysis
of floating-point routines can be based on the {\sl statistical
distribution of the fraction parts} of randomly chosen normalized
floating-point numbers. In this case the facts are quite surprising,
and there is an interesting theory which accounts for the unusual
phenomina which are observed.

For convenience let us temporarily assume that
we are dealing with floating-{\sl decimal} (i.e., radix 10)
arithmetic; modifications of the following discussion to any
other positive integer base $b$ will be very straightforward.
Suppose we are given a ``random'' positive normalized number
($e, f) = 10↑e \cdot f$. Since $f$ is normalized, we know that
its leading digit is 1, 2, 3, 4, 5, 6, 7, 8, or 9, and it seems
natural to assume that each of these leading digits will occur
about one-ninth of the time. But, in fact, the behavior in practice
is quite different. For example, the leading digit tends to
be equal to 1 over 30 percent of the time!

One way to test the assertion just made is to take
a table of physical constants (e.g., the speed of light, the
force of gravity) from some standard reference. If we look at
the {\sl Handbook of Mathematical Functions} (U.S. Dept of Commerce,
1964), for example, we find that 8 of the 28 different physical
constants given in Table 2.3, roughly 29 percent, have leading
digit equal to 1. The decimal values of $n!$ for $1 ≤ n ≤ 100$
include exactly 30 entries beginning with 1; so do the decimal
values of 2$↑|εn$ and of $F↓n, for 1 ≤ n ≤ 100$. We might also
try looking at census reports, or a Farmer's Almanack, etc.\n(bbbbbb???but
not a telephone directory).

In the days before pocket calculators, the pages
in well-used tables of logarithms tended to get quite dirty
in the front, while the last pages stayed relatively clean and
neat. This phenomenon was apparently first mentioned by the
American astronomer Simon Newcomb [{\sl Amer.\ J. Math. \bf 4}
(1881), 39--40], who gave good grounds for believing that the
leading digit $d$ occurs with probability log$↓{10}(1 + 1/d)$.
The same distribution was discovered empirically, many years
later, by Frank Benford, who reported the results of 20,229
observations taken from different sources [{\sl Proc.\ Amer.\
Philosophical Soc.\ \bf 78} (1938), 551--572].

In order to account for this leading-digit law,
let's take a closer look at the way we write numbers in floating-point
notation. If we take any positive number $u$, its leading digits
are determined by the value (log$↓{10} u)$ mod 1: The leading
digit is less than $d$ if and only if
$$(log$↓{10} u)$mod 1 < log$↓{10} d,\eqno (1)$
|qctr \9skip since 10$f↓u = 10↑{($log$↓{10} u)$mod 1}.

Now if we have any ``random'' positive number $U$,
chosen from sone reasonable distribution such as might occur
in nature, we might expect that (log$↓{10} U)$mod 1 would be
uniformly distributed between zero and one, at least to a very
good approximation. (Similarly, we expect $U$ mod 1, $U↑2$ mod
1, $\sum \sqrt{U + π}$ mod 1, etc., to be uniformly distributed.
We expect a roulette wheel to be unbiased, for essentially the
same reason.) Therefore by (1) the leading digit will be 1 with
probability log$↓{10} 2 \approx 30.103 percent; it will be 2
with probability log↓{10} 3 - log↓{10} 2 \approx 17.609 percent;
and, in general, if r$ is any real value between 1 and 10, we
ought to have $10f↓U ≤ r$ approximately log$↓{10} r$ of the
time.

Another way to explain this law is to say that
a random value $U$ should appear at a random point on a slide
rule, according to the uniform distribution, since the distance
from the left end of a slide rule to the position of $U$ is
proportional to (log$↓{10} U)$mod 1. The analogy between slide
rules and floating-point calculation is very close when multiplication
and division are being considered.

The fact that leading digits tend to be small is
important to keep in mind; it makes the most obvious techniques
of ``average error'' estimation for floating-point calculations
invalid. The relative error due to rounding is usually a little
more than expected.

Of course, it may justly be said that the heuristic
argument above does not prove the stated law. It merely shows
us a plausible reason why the leading digits behave the way
they do. An interesting approach to the analysis of leading
digits has been suggested by R. Hamming: Let $p(r)$ be the probability
that $10f↓U ≤ r$, where $1 ≤ r ≤ ≤ 10$ and $f↓U$ is the normalized
fraction part of a random normalized floating-point number $U$.
If we think of random quantities in the real world, we observe
that they are measured in terms of arbitrary units; and if we
were to change the definition of a meter or a gram, many of
the fundamental physical constants would have different values.
Suppose then that all of the numbers in the universe are suddenly
multiplied by a constant factor $c$; our universe of random
floating-point quantities should be essentially unchanged by
this transformation, so|newtype 58320---Computer Programming\quad
(Knuth/A.-W)\quad ch.
%folio 330 galley 18 (C) Addison-Wesley 1978
 4\quad f. 330\quad g. 18a
$$\quad Multiplying everything by $c$ changes (log$↓{10} U)$mod
1 to (log$↓{10} U +$ log$↓{10} c)$mod 1. It is now time to set
up the formulas which describe the desired behavior; we may
assume that $1 ≤ c ≤ 10$. By definition,
$$$p(r) =$ probability\biglp (log$↓{10} U)$mod 1 ≤ log$↓{10}
r\bigrp $.
|qctr \9skip By our assumption, we should also have
$$$p(r) |tab =$ probability\biglp (log$↓{10} U +$ log$↓{10}
c)$mod 1 ≤ log$↓{10} r|newtype )⊗\4skip \20skip ⊗=|zfa \cr
\-20skip ⊗\qquad$ probability\biglp (log$↓{10} U)$mod 1 ≤ log$↓{10}
r $- log$↓{10} c\cr$
or\qquad (log$↓{10} U)$mod 1 ≥ 1 - log$↓{10} c\bigrp ,\qquad$
if\qquad $c ≤ r$;
|qright \4skip ⊗\qquad probability\biglp 1 - log$↓{10} c ≤$ (log$↓{10}
U)$mod 1 ≤ 1 + log$↓{10} r $- log$↓{10} c\bigrp ,\cr$
if\qquad $c ≥ r$;
|qright \6skip \8skip ⊗=|zfa \cr
\-8skip ⊗\qquad p(r/c) + 1 - p(10/c),\qquad if\qquad $c ≤ r;\cr
\4skip ⊗\qquad p(10r/c) - p(10/c),\qquad \quad$ if\qquad $c
≥ r.\eqno (2)\cr
\9skip$ Let us now extend the function $p(r)$ to values outside
the range $1 ≤ r ≤ 10$, by defining $p(10↑nr) = p(r) + n$; then
if we replace 10/$c$ by $d$, the last equation of (2) may be
written
$$$p(rd) = p(r) + p(d).\eqno (3)$
|qctr \9skip |newtype If our assumption about invariance of
the distribution under multiplication by a constant factor is
valid, then Eq.\ (3) must hold for all $r > 0$ and $1 ≤ d ≤ 10$.
The facts that $p(1) = 0, p(10) = 1$ now imply that
$$$1 = p(10) = p\biglp (\sum \sqrt{210})↑n\bigrp = p(\sum \sqrt{210})
+ p\biglp (\sum \sqrt{210})↑{n-1} =\cdot \cdot \cdot = np(\sum
\sqrt{210})$;
|qctr \9skip hence we deduce that $p(10↑{m/n}) = m/n$ for all
positive integers $m$ and $n$. If we now decide to require that
$p$ is continuous, we are forced to conclude that $p(r) = log↓{10}
r$, and this is the desired law.
|qleft |newtype \quad Although this argument may be more convincing
than the first one, it doesn't really hold up under scrutiny
if we stick to conventional notions of probability. The traditional
way to make the above argument rigorous is to assume that there
is some underlying distribution of numbers $F(u)$ such that
a given positive number $U$ is ≤$u$ with probability $F(u)$;
and that
$$$p(r) = ↓{m} \biglp F(10↑mr) - F(10↑m)\bigrp \eqno (4)$
|qctr \9skip summed over all values $-∞ < m < ∞$. Our assujptions
about scale invariance and continuity led to the conclusion
that
$$$p(r) =$ log$↓{10} r$.
|qctr \9skip Using the same argument, we could ``prove'' that
$$$↓{m} \biglp F(b↑mr) - F(b↑m)\bigrp =$ log$↓b r,\eqno (5)$
|qctr \9skip |newtype for each integer $b ≥ 2$, when $1 ≤ r
≤ b$. But there {\sl is} no distribution function $F$ which satisfies
this equation for all such $b$ and $r$.

One way out of the difficulty is to regard the
logarithm law $p(r) =$ log$↓{10} r$ as only a very close {\sl
approximation} to the true distribution. The true distribution
itself may perhaps be changing as the universe expands, becoming
a better and better approximation as time goes on; and if we
replace 10 by an arbitrary base $b$, the approximation appears
to be less accurate (at any given time) as $b$ gets larger.
Another rather appealing way to resolve the dilemma, by abandoning
the traditional idea of a distribution function, has been suggested
by R. A. Raimi, {\sl AMM \bf 76} (1969), 342--348.

The hedging in the last paragraph is probably a
very unsatisfactory explanation, and so the following further
calculation (which sticks to regorous mathematics and avoids
any intuitive, yet paradoxical, notions of probability) should
be welcome. Let us consider the distribution of the leading
digits of the {\sl positive integers,} instead of the distribution
for some imagined set of real numbers. The investigation of
this topic is quite interesting, not only because it sheds some
light on the probability distributions of floating-point data,
but also because it makes a particularly instructive example
of how to combine the methods of discrete mathematics with the
methods of infinitesimal calculus.

In the following discussion, let $r$ be a fixed
real number, $1 ≤ r ≤ 10$; we will atempt to make a reasonable
attempt at defining $p(r)$, the ``probability'' that the representation
10$↑e|infsup N \cdot f↓N$ of a ``random'' positive integer $N$
has $10f↓N < r$, assuming infinite precision.

To start, let us try to find the probability using
a limiting method like the definition of ``Pr'' in Section 3.5.
One nice way to rephrase that definition is to define
$$
|qleft ⊗1,\qquad if $n = 10↑e \cdot f$ where 10$f < r;\cr
\quad P↓0(n) =|zfa$
 |qleft i.e., if (log$↓{10} n)$mod 1 < log$↓{10} r;\qquad (6)$
|qright ⊗0,\qquad otherwise.\cr
\9skip Now $P↓0(1), P↓0(2), . . $. is an infinite sequence of
zeros and ones, with ones to represent the cases which contribute
to the probability. We can try to ``average out'' this sequence,
by defining
$$$P↓1(n) = {1\over n} ↓{1≤k≤n} P↓0(k).\eqno (7)$
|qctr \9skip Thus if we generate a random integer between 1
and $n$ using the techniques of Chapter 3, and convert it to
floating-decimal form $(e, f)$, the probability that $10f <
r$ is exactly $P↓1(n)$. It is natural to let lim$↓{n→∞} P↓1(n)$
be the ``probability'' $p(r)$ we are after, and that is just
what we did in Section 3.5.

But in this case the limit does not exist: For
example, let us consider the subsequence
$$$P↓1(s), P↓1(10s), P↓1(100s), \ldotss , P↓1(10↑ns), \ldots
$,
|qctr |newtype 58320---Computer
%folio 333 galley 19 (C) Addison-Wesley 1978
Programming\quad (Knuth/A.W.)\quad F. 333\quad ch. 4\quad g.
19a
$$where $s$ is a real number, $1 ≤ s ≤ 10$. If $s ≤ r$, we find
that
$$$P↓1(10↑ns)$

|qleft \4skip = ${1\over 10↑ns}$ (\lceil r\rceil - 1 + \lceil
10r\rceil - 10 + \cdots + \lceil 10$↑{n-1}r\rceil - 10↑{n-1}
+ \lfloor 10↑ns\rfloor$

|qleft  \3skip + 1 - 10$↑n)$
|qright \4skip = ${1\over 10↑ns}$ \biglp r(1 + 10 + \cdots +
10$↑{n-1}) + O(n) + \lfloor 10↑ns\rfloor - 1 - 10 -\cdot \cdot
\cdot - 10↑n|newtype )$

|qleft \4skip |newtype = ${1\over 10↑ns}$ \biglp ${1\over 9}$(10$↑nr
- 10↑{n+1}) + \lfloor 10↑ns\rfloor + O(n)\bigrp ,\eqno (8)$
|qleft \cr
\9skip where $r = r↓0 \cdot r↓1r↓2 . . $. in decimal notation.
As $n → ∞, P↓1(10↑ns)$ therefore approaches the limiting value
$1 + (r - 10)/9s$. The above calculation for the case $s ≤ r$
can be modified so that it is valid for $s > r$ if we replace
$\lfloor 10↑ns\rfloor + 1$ by \lceil 10$↑nr\rceil $; so when
$s ≥ r$, we find that the limiting value is $10(r - 1)/9s. [$See
J. Franel, {\sl Naturforschende Gesellschaft, Vierteljahrsschrift
\bf 62} (Z\"urich, 1917), 286--295.]

Therefore $P↓1(n)$ has subsequences whose
limit goes from $(r - 1)/9$ up to $10(r - 1)/9r$ and down again
to $(r - 1)/9$, as $s$ goes from 1 to $r$ to 10. We see that
$P↓1(n)$ has no limit, and $P↓1(n)$ is not a particularly good
approximation to our conjectured answer $\log↓{10} r$ either!

Since $P↓1(n)$ doesn't approach a limit, we can
try to use the same idea as (7) once again, to ``average out''
this anomalous behavior. In general, let
$$$P↓{m+1}(n) = {1\over n} ↓{1≤k≤n} P↓m(k).\eqno (9)$
|qctr \9skip |newtype Then $P↓{m+1}(n)$ will tend to be a more
well-behaved sequence than $P↓m(n)$. Let us try to examine the
behavior of $P↓{m+1}(n)$, for large $n$. Our experience above
with the special case $m = 0$ indicates that it might be worth
while to consider the subsequence $P↓{m+1}(10↑ns)$; and the
following results can now be derived:

\thbegin Lemma Q. {\sl For any integer $m ≥ 1$ and any real
number $ε > 0$, there are functions $Q↓m(s)$, $R↓m(s)$ and an integer
$N↓m(ε)$, such that whenever $n > N↓m(ε)$ and $1 ≤ s ≤ 10$, we have}
$$(10)|zfa
 |qright \-8skip |P$↓m(10↑ns) - Q↓m(s)| < ε,\qquad$ if\qquad
$s ≤ r$;
|qctr \4skip |P$↓m(10↑ns) - \biglp Q↓m(s) + R↓m(s)\bigrp | <
ε,\qquad$ if\qquad $s > r$.
|qctr \9skip Furthermore the functions Q$↓m(s), R↓m(s) satisfy
the relations$
$$Q$↓m(s) |tab = {1\over s} \left({1\over 9} \int ↑{10}↓{1}Q↓{m-1}(t)
dt + \int ↑{s}↓{1}Q↓{m-1}(t) dt + {1\over 9} \int ↑{10}↓{r}
R↓{m-1}(t) dt};⊗\4skip \hfill R↓m(s) ⊗= {1\over s} \int ↑{s}↓{r}
R↓{m-1}(t) dt;\eqno (11)\cr
\4skip \hfill Q↓0(s) ⊗= 1,\qquad R↓0(s) = -1.\cr
\12skip Proof. \xskip$ Consider the functiohs $Q↓m(s), R↓m(s)$
defined by (11), and let
$$
|qctr \8skip ⊗S$↓m(t) =|zfa \cr
\-8skip ⊗⊗Q↓m(t),⊗t ≤ r.\cr
\4skip ⊗⊗Q↓m(t) + R↓m(t),\qquad t > r.\cr
\9skip |newtype$ We will prove the lemma by induction on $m$.
|qleft |newtype \quad First, let $m = 1$; then $Q↓1(s) = \biglp
1 + (s - 1) + (r - 10)/9\bigrp /s = 1 + (r - 10)/9s$, and $R↓1(s)
= (r - s)/s$. From (8) we find that
|qleft |newtype $|P↓1(10↑ns) - S↓1(s)| = O(n)/10↑n$;
|qctr \9skip this establishes the lemma when $m = 1$.

Now for $m > 1$, we have $P↓m(10↑ns) =$

|qleft \6skip ${1\over s}$ \left(\sum ↓{0≤j<n} ${1\over 10↑{n-j}}$
\sum ↓{10$↑j≤k<10↑{j+1}} {1\over 10↑j} P↓{m-1}(k) + \sum ↓{10↑n≤k≤10↑ns}
{1\over 10↑n} P↓{m-1}(k)}$,
|qctr \9skip and we want to approximate this quantity. By induction
the difference
$$$|bigabs \sum ↓{10↑j≤k≤10↑jq} {1\over 10↑j} P↓{m-1}(k) - \sum
↓{10↑j≤k≤10↑jq} {1\over 10↑j} S↓{m-1}\left({k\over 10↑j}}|bigabs
\eqno (13)$
|qctr \9skip is less than $qε$ when $1 ≤ q ≤ 10$ and $j > N↓{m-1}(ε)$;
and since $S↓{m-1}(t)$ is continuous, it is a Riemann-integrable
function, and the difference
$$$|bigabs \sum ↓{10↑j≤k≤10↑jq} {1\over 10↑j} S↓{m-1}\left({k\over
10↑j}} - \int ↑{q}↓{1} S↓{m-1}(t) dt |bigabs \eqno (14)$
|qctr \9skip is less than $ε$ for all $j$ greater than some
number $N$, independent of $q$, by the definition of integration.
We may choose $N$ to be >$N↓{m-1}(ε)$. Therefore for $n > N$,
the difference
$$$|bigabs P↓m(10↑ns) - {1\over s} \left(\sum ↓{0≤j<n} {1\over
10↑{n-j}} \int ↑{10}↓{1} S↓{m-1}(t) dt + \int ↑{s}↓{1} S↓{m-1}(t)
dt}|bigabs \eqno (15)$
|qctr \9skip is bounded by
$$$\sum ↓{0≤j≤N} {M\over 10↑{n-j}} + \sum ↓{N<j<n} {11ε\over
10↑{n-j}} + 11ε$,
|qctr \9skip if $M$ is an upper bound for (13) + (14) which
is valid for all $j$. Finally, the sum \sum ↓{$0≤j<n} (1/10↑{n-j})$,
which appears in (15), is equal to $(1 - 1/10↑n)/9$; so
$$$|bigabs P↓m(10↑ns) - {1\over s} \left({1\over 9} \int ↑{10}↓{1}S↓{m-1}(t)
dt + \int ↑{s}↓{1}S↓{m-1}(t) dt}|bigabs$
 |qctr \9skip |newtype can be made smaller than, say, 20$ε$,
if $n$ is taken large enough. comparing this with (10) and (11)
completes the proof.

|qleft \12skip \quad The gist of Lemma Q is that we have the
limiting relationship
$$\mathop{lim↓{$n→∞} P↓m(10↑ns) = S↓m(s).\eqno (16)$
|qctr \9skip ???|newtype 58320---Computer Programming\quad (Knuth/A.-W
%folio 336 galley 20 (C) Addison-Wesley 1978
.)\quad f. 336\quad ch. 4\quad g. 20a
$$Also, since $S↓m(s)$ is not constant as $s$ varies, the limit
$$\mathop{lim↓{$n→∞} P↓m(n)$,
|qctr \9skip which would be our desired ``probability,'' does
not exist for any $m$. The situation is shown in Fig.\ 5, which
shows the values of $S↓m(s)$ when $m$ is small and $r = 2$.

Even though $S↓m(s)$ is not a constant, so that
we do not have a definite limit for $P↓m(n)$, note that already
for $m = 3$ in Fig.\ 5 the value of $S↓m(s)$ stays very close
to $\log↓{10} 2 = 0.30103 \ldotsm\,$. Therefore we have good reason
to suspect that $S↓m(s)$ is very close to $\log↓{10} r$ for all
large $m$, and, in fact, that the sequence of functions $\langle
S↓m(s)\rangle$ converges uniformly to the constant function
log$↓{10} r$.

It is interesting to prove this conjecture by explicitly
calculating $Q↓m(s) Q↓m(s)$ and $R↓m(s)$ for all $m$, as in
the proof of the following theorem:

\thbegin Theorem F. {\sl Let $S↓m(s)$ be the limit defined
in $(16)$. For all $ε > 0$, there exists a number $N(ε)$ such that}
$$|S$↓m(s) $- log$↓{10} r| < ε\qquad$ for\qquad $1 ≤ s ≤ 10,\eqno
(17)$
|qctr \9skip whenever m > N(ε).

|qleft \6skip |newtype {\bf Fig. 5. \xskip} The probability
that the leading digit is 1.
|qctr \6skip |newtype {\sl Proof.} In view of Lemma Q, we can
prove this result if we can show there is a number $M$ depending
on $ε$ such that, for $1 ≤ s ≤ 10$,
$$|Q$↓m(s) $- log$↓{10} r| < ε\qquad$ and\qquad $|R↓m(s)| <
ε\eqno (18)$
|qctr \9skip for all $m > M$.

It is not difficult to solve the recurrence formula
(11) for $R↓m$: We have $R↓0(s) = -1$, $R↓1(s) = -1 + r/s$, $R↓2(s)
= -1 + (r/s)\biglp 1 + \ln (s/r)\bigrp $, and in general
$$$R↓m(s) = -1 + {r\over s} |newtype \left(|newtype 1 + {1\over
1!}$ ln \left(${s\over r}$} + ${1\over (m - 1)!}$ \left(ln \left(${s\over
r}$}}$↑{m-1}|newtype }|newtype .\eqno (19)$
|qctr \9skip For the stated range of $s$, this converges uniformly
to
$$$-1 + (r/s)$ exp \biglp ln$ (s/r)|newtype ) |newtype = 0$.
|qctr \9skip |newtype \quad The recurrence (11) for $Q↓m$ takes
the form
$$$Q↓m(s) = {1\over s} \left(c↓m + 1 + \int ↑{s}↓{1}Q↓{m-1}(t)
dt},\eqno (20)$
|qctr \9skip where
$$$c↓m = {1\over 9} \left(\int ↑{10}↓{1}Q↓{m-1}(t) dt + \int
↑{10}↓{r}R↓{m-1}(t) dt} - 1.\eqno (21)$
|qctr \9skip The solution to recurrence (20) is easily found
by trying out the first few cases and guessing at a formula
which can be proved by induction; we find that
$$$Q↓m(s) = 1 + {1\over s} \left(c↓m + {1\over 1!} c↓{m-1}
\ln s +\cdot \cdot \cdot + {1\over (m - 1)!} c↓1($ln $s)↑{m-1}}.\eqno
(22)$
|qctr \9skip \quad It remains for us to calculate the coefficients
$c↓m$, which by (19), (21), and (22) satisfy the relations
$$$c↓1 = (r - 10)/9$
|qctr \9skip $\quad c↓{m+1} = {1\over 9} |newtype \left(|newtype
c↓m$ ln 10 + ${1\over 2!}$ $c↓{m-1}($ln 10)$↑2 +\cdot \cdot
\cdot + {1\over m!} c↓1($ln 10)$↑$

|qleft m\4skip + r\left(1 + ${1\over 1!}$ ln ${10\over r}$ +\cdot
\cdot \cdot + ${1\over m!}$ \left(ln ${10\over r}}↑m} - 10|newtype
}|newtype .\eqno (23)$
|qright \6skip This sequence appears to be very complicated,
but actually we can analyze it without difficulty with the help
of generating functions. Let
$$$C(z) = c↓1z + c↓2z↑2 + c↓3z↑3 +\cdotss$;
|qctr \9skip then since 10$↑z = 1 + z$ ln 10 + (1/2!) $z$ ln
10)$↑2 +\cdotss$, we deduce that
$$
|qctr $\hfill c↓{m+1} ⊗= ⊗{1\over 10} c↓{m+1} + {9\over 10}
c↓{m+1}\cr
\4skip ⊗= ⊗{1\over 10} \left(c↓{m+1} + c↓m$ ln 10 +\cdot \cdot
\cdot + ${1\over m!}$ c$↓1($ln 10)$↑m}\cr
\6skip ⊗⊗+ {r\over 10} |newtype \left(|newtype 1 +\cdot \cdot
\cdot + {1\over m!} \left($ln ${10\over r}}↑m|newtype }|newtype
- 1\cr
\9skip$ is the coefficient of $z↑{m+1}$ in the function
$$${1\over 10}C(z)10↑z + {r\over 10} \left({10\over r}}↑z\left({z\over
1 - z}} - {z\over 1 - z} .\eqno (24)$
|qctr \9skip |newtype This condition holds for all values of
$m$, so (24) must equal $C(z)$, and we obtain the explicit formula
$$$C(z) = {-z\over 1 - z} \left({(10/r)↑{z-1} - 1\over 10↑{z-1}
- 1}}.\eqno (25)$
|qctr \9skip \quad We want to study asymptotic properties of
the coefficients of $C(z)$, to complete our analysis. The large
parenthesized quantity in (25) approaches ln (10/$r)/$ln 10
= 1 - log$↓{10} r$ as $z → 1$, so we see that
$$$C(z) + {1 $- log$↓{10} r\over 1 - z} = R(z)\eqno (26)$
|qctr \9skip is an analytic function of the complex variable
$z$ in the circle
$$$|z| < |bigabs 1 + {2πi\over$ ln 10}|bigabs $.
|qctr \9skip In particular, $R(z)$ converges for $z = 1$, so
its coefficients approach zero. This proves that the coefficients
of $C(z)$ behave like those of (log$↓{10} r - 1)/(1 - z)$, that
is,
$$\mathop{lim↓{$m→∞} c↓m =$ log$↓{10} r - 1$.
|qctr \9skip \quad Finally, we may combine this with (22), to
show that $Q↓m(s)$ approaches
$$1 + ${log↓{10} r - 1\over s}$ \left(1 + ln $s + {1\over 2!}
($ln $s)↑2 +\cdot \cdot \cdot } =$ log$↓{10} r$
|qctr \9skip uniformly for $1 ≤ s ≤ 10$.
|qleft β?????????|newtype 58320---Computer Programming\quad
(Knuth/A.-W.)\quad f. 338\quad ch. 4\quad g. 21a
$$\quad The above proofs of Lemma Q and Theorem F are slight
simplifications and amplifications of methods due to B. J. Flehinger,
{\sl AMM \bf 73} (1966), 1056--1061. Many authors have written
about the distribution of initial digits, showing that the logarithm
lw is a good approximation for many undelying distributions;
see the survey by Ralph A. Raimi, {\sl AMM \bf 83} (1976), to
appear, for a comprehensive review of the literature. An appealing
new explanation of the phenomenon as applied to integers has
been proposed by D. I. A. Cohen, {\sl J. Comb.\ Theory} (A) {\bf 20}
(1976), to appear. Another interesting (and different) treatment
of floating-point distribution has been given by Alan G. Konheim,
{\sl Math.\ Comp.\ \bf 19} (1965), 143--144. For an approach to
the definition of probability under which the logarithm law
holds exactly over the integers, see exercise 17.

Therefore we have established the logarithmic law
for integers by direct calculation, at the same time seeing
that it is an extremely good approximation to the average behavior
although it is never precisely achieved.

|qleft \24skip {\:r EXERCISES

\exno 1. [13] Given that $u$
and $v$ are nonzero floating-decimal numbers {\sl with the same
sign,} what is the approximate probability that fraction overflow
occurs during the calculation of $u \oplus v$, according to
Sweeney's tables?

\exno 2. [42] Make further tests of floating-point addition
and subtraction, to confirm or improve on the accuracy of Sweeney's
tables.

\exno 3. [15] What is the probability that the two leading digits
of a floating-decimal number are ``23'', according to the logarithm
law?

\exno 4. [18] The text points out
that the front pages of a well-used table of logarithms get
dirtier than the back pages do. What if we had a {\sl antilogarithm}
table instead, i.e., a table giving the value of $x$ when log$↓{10}
x$ is given. Which pages of such a table would be the dirtiest?

\exno 5. [M20] Suppose that $U$ is a real number uniformly distributed
in the interval $0 < U < 1$. What is the distribution of the
leading digits of $U?$

\exno 6. [23] If we have binary computer
words containing $n + 1$ bits, we might use $p$ bits for the
fraction part of floating-binary numbers, one bit for the sign,
and $n - p$ bits for the exponent. This means that the range
of values representable, ile., the ratio of the largest to the
smallest positive normalized value, is essentially $2↑{2n-p}$.
The same computer word could be used to represent floating-{\sl
hexadecimal} numbers, i.e., floating-point numbers with radix
16, with $p + 2$ bits for the fraction part \biglp$ (p + 2)/4$
hexadecimal digitr\bigrp and $n - p - 2$ bits for the exponent;
then the range of values would be $16↑2|supsup n|supsup -|supsup
p|supsup 2 = 2↑{2n-p}$, the same as before, and with more bits
in the fraction part. This may sound as if we are getting something
for noting, but the normalization condition for base 16 is weaker
in that there may be up to three leading zero bits in the fraction
part; thus not all of the $p + 2$ bits are ``suc?????????gnificant
bits.''
|qleft \qquad On the basis of the logarithmic law, what are
the probabilities that the fraction part of a positive normalized
radix 16 floating-point number has exactly 0, 1, 2, and 3 leading
zero bits? Discuss the desirability of hexadecimal versus binary.

\exno 7. [HM28] Prove that
there is no distribution function $F(u)$ which satisfies (5)
for each integer $b ≥ 2$, and for all real values $r$ in the
range $1 ≤ r ≤ b$.

\exno 8. [M23] Does (10) hold when $m
= 0$ for suitable $N↓0(ε)?$

\exno 9. [M24] (P. Diaconis.)\xskip
Prove that lim↓{$m→∞} P↓m(n) = P↓0(1)$ for all fixed $n$.

\exno 10. [HM28] The text shows that $c↓m =$ log$↓{10} r - 1
+ ε↓m$, where $ε↓m$ approaches zero as $m→∞$. obtain the next
term in the asymptotic expansion of $c↓m$.

\exno 11. [M15] Show that if $U$ is a
random variable which is distributed according to the logarithmic
law, then so is 1/$U$.

\exno 12. [M25] (R. W. Hamming.)\xskip The purpose of this exercise
is to show that the result of floating-point multiplication
tends to obey the logarithmic law more perfectly than the operands
do. Let $U$ and $V$ be random, normalized, positive floating-point
numbers, whose fraction parts are independently distributed
with the respective density functions $f(x)$ and $g(x)$. Thus,
$f↓u ≤ r$ and $f↓v ≤ s$ with probability $\int ↑{r}↓{1/b}\int
↑{s}↓{1/b}f(x)g(y) dy dx$, for $1/b ≤ r, s ≤ 1$. Let $h(x)$
be the density function of the fraction part of $U \times V$
(unrounded). Define the {\sl abnormality A(f)} of a density
function $f$ as the maximum relative error,
$$$A(f) = \mathop{$max↓{$1/b≤x≤1} |bigabs {f(x) - l(x)\over
l(x)} |bigabs $,
|qctr \9skip where $l(x) = 1/x$ ln $b$ is the density of the
logarithmic distribution. Prove that $A(h) ≤$ min\biglp A(f),
A(g)\bigrp . [In particular, if either factor has logarithmic
distribution the product does also.]

\exno 13. [M20] The floating-point multiplication routine, Algotithm
4.2.1M, requires zero or one left shifts during normalization,
depending on whether $f↓uf↓v ≥ 1/b$ or not. Assuming that the
input operands are independently distributed according to the
logarithmic law, what is the probability that no left shift
is needed for normalization of the result?

\exno 14. [HM30] Let $U$ and $V$ be random, normalized, positive
floating-point numbers whose fraction parts are independently
distributed according to the logarithmic law, and let $p↓k$
be the probability that the difference in their exponents is
$k$. Assuming that the distribution of the exponents is independent
of the fraction parts, give an equation for the probability
that ``fraction overflow'' occurs during the floating-point
addition of $U \oplus V$, in terms of the base $b$ and the quantities
$p↓0$, $p↓1$, $p↓2$, $\ldotss\,$. Compare this result with exercise
1. (Ignore rounding.)

\exno 15. [HM28] Let $U, V, p↓0, p↓1, . . $. be as in exercise
14, and assume that radix 10 arithmetic is being used. Show
that regardless of the values of $p↓0$, $p↓1$, $p↓2$, \ldotss $, the
sum $U \oplus V$ will {\sl not} obey the logarithmic law exactly,
and in fact the probability that $U \oplus V$ has leading digit
1 is always strictly {\sl less} than log$↓{10} 2$.

\exno 16. [HM28] (P. Diaconis.)\xskip Let $P↓0(n)$ be 0 or 1 for each
$n$, and define ``probabilities'' $P↓{m+1}(n)$ by repeated averaging,
as in (9). Show that if lim↓{$n→∞} P↓1(n)$ does not exist, neither
does lim↓{$n→∞} P↓m(n)$ for any $m$.\xskip [{\sl Hint:} Prove
that $(a↓1 +\cdots + a↓n)→0$ and $a↓{n+1} ≤ a↓n +
M/n$ for some fixed constant. $M > 0$ implies that $a↓n → 0.]$

\exno 17. [M25] (R. L. Duncan.)\xskip Another way to define $\Pr\biglp
S(n)\bigrp$ is to evaluate $\lim↓{n→∞} \biglp (\sum ↓{S(k)$
and $1≤k≤n} 1/k)/H↓n\bigrp $, since it can be shown that this
``harmonic probability'' exists and equals $\Pr\biglp S(n)\bigrp$
whenever the latter exists according to the definition in Section
3.5.
|qleft ??????????????\quad ??\quad ??\quad ??\quad ??\quad ??\quad
??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad
??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad
??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad
??\quad ??\quad ??\quad ??\quad ??\quad ??\quad ??\quad